BigQuery TimePartitioning IT

2019-01-28 Thread Wout Scheepers
Hey all,

For my BigQuery clustering PR I wrote an integration test to test time 
partitioning and clustering [1].
Can anyone create the “BigQueryTimePartitioningIT” dataset in the beam testing 
project (apache-beam-testing)? Or is dataset creation included somehow in the 
gradle setup?

Thanks,
Wout

[1] 
https://github.com/apache/beam/pull/7061/commits/2df3611a0c1afa4602b21dd655e85fae4f30200d#diff-00c26efe2f7d595ee31affd11a8cbce2





Re: Stand at FOSDEM 2019

2018-11-30 Thread Wout Scheepers
I’m based in Brussels and happy to help out.

Wout

From: Griselda Cuevas 
Reply-To: "dev@beam.apache.org" 
Date: Thursday, 29 November 2018 at 21:44
To: "dev@beam.apache.org" 
Subject: Re: Stand at FOSDEM 2019

+1 -- I'm happy to help with the merch, I'll be attending and will help staff 
the booth :)

G

On Thu, 29 Nov 2018 at 05:46, Suneel Marthi 
mailto:smar...@apache.org>> wrote:
+1

On Thu, Nov 29, 2018 at 6:14 AM Matthias Baetens 
mailto:baetensmatth...@gmail.com>> wrote:
Hey Max,

Great idea. I'd be very keen to join. I'll look at my calendar over the weekend 
to see if this would work.
Are you going yourself?

Cheers,
Matthias

On Thu, 29 Nov 2018 at 11:06 Maximilian Michels 
mailto:m...@apache.org>> wrote:
Hi,

For everyone who might be attending FOSDEM19: What do you think about
taking a slot for Beam at the Apache stand?

A slot is 2-3 hours. It is a great way to spread the word about Beam. We
wouldn't have to prepare much, just bring some merch.

There is still plenty of space:
https://cwiki.apache.org/confluence/display/COMDEV/FOSDEM+2019

Cheers,
Max

PS: FOSDEM is an open-source conference in Brussels, Feb 2-3, 2019
--



Re: BigqueryIO field clustering

2018-11-28 Thread Wout Scheepers
Hey all,

Almost two weeks ago, I create a PR to support BigQuery clustering [1].
Can someone please have a look?

Thanks,
Wout

1: https://github.com/apache/beam/pull/7061


From: Lukasz Cwik 
Reply-To: "u...@beam.apache.org" 
Date: Wednesday, 29 August 2018 at 18:32
To: dev , "u...@beam.apache.org" 
Cc: Bob De Schutter 
Subject: Re: BigqueryIO field clustering

+dev@beam.apache.org<mailto:dev@beam.apache.org>

Wout, I assigned this task to you since it seems like your interested in 
contributing.
The Apache Beam contribution guide[1] is a good place to start for answering 
questions on how to contribute.

If you need help in getting stuff reviewed or having questions, feel free to 
reach out on dev@beam.apache.org<mailto:dev@beam.apache.org> or on Slack.

1: https://beam.apache.org/contribute/


On Wed, Aug 29, 2018 at 1:28 AM Wout Scheepers 
mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Hey all,

I’m trying to use the field clustering beta feature in bigquery [1].
However, the current Beam/dataflow worker bigquery api service dependency is 
‘google-api-services-bigquery: com.google.apis: v2-rev374-1.23.0’, which does 
not include the clustering option in the TimePartitioning class.
Hereby, I can’t specify the clustering field when loading/streaming into 
bigquery. See [2] for the bigquery api error details.

Does anyone know a workaround for this?

I guess that in the worst case I’ll have to wait until Beam supports a newer 
version of the bigquery api service.
1.After checking the Beam Jira I’ve found 
BEAM-5191<https://jira.apache.org/jira/browse/BEAM-5191>. Is there any way I 
can help to push this forward and make this feature possible in the near future?

Thanks in advance,
Wout

[1] https://cloud.google.com/bigquery/docs/clustered-tables
[2] "errorResult" : {
  "message" : "Incompatible table partitioning specification. Expects 
partitioning specification interval(type:day,field:publish_time) 
clustering(clustering_id), but input partitioning specification is 
interval(type:day,field:publish_time)",
  "reason" : "invalid"
}


Re: Wiki edit access

2018-11-19 Thread Wout Scheepers
Sorry, I assumed I would be the same account as the apache jira. Just created 
new one.
Full name: “Wout Scheepers”
email: woutscheep...@gmail.com


From: Lukasz Cwik 
Reply-To: "dev@beam.apache.org" 
Date: Friday, 16 November 2018 at 18:39
To: dev 
Subject: Re: Wiki edit access

I tried finding your account on cwiki.apache.org<http://cwiki.apache.org> but 
was unable to, what is your user id on 
cwiki.apache.org<http://cwiki.apache.org>?

On Thu, Nov 15, 2018 at 7:51 AM Wout Scheepers 
mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Can anyone give me edit access for the wiki?

Thanks,
Wout


Wiki edit access

2018-11-15 Thread Wout Scheepers
Can anyone give me edit access for the wiki?

Thanks,
Wout


Re: Bigquery streaming TableRow size limit

2018-11-15 Thread Wout Scheepers
Thanks for your thoughts.

Also, I’m doing something similar when streaming data into partitioned tables.
From [1]:
“ When the data is streamed, data between 7 days in the past and 3 days in the 
future is placed in the streaming buffer, and then it is extracted to the 
corresponding partitions.”

I added a check to see if the event time is within this timebound. If not, a 
load job is triggered. This can happen when we replay old data.

Do you also think this would be worth adding to BigqueryIO?
If so, I’ll try to create a PR for both features.

Thanks,
Wout

[1] : 
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables


From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Wednesday, 14 November 2018 at 14:51
To: "dev@beam.apache.org" 
Subject: Re: Bigquery streaming TableRow size limit

Generally I would agree, but the consequences here of a mistake are severe. Not 
only will the beam pipeline get stuck for 24 hours, _anything_ else in the 
user's GCP project that tries to load data into BigQuery will also fail for the 
next 24 hours. Given the severity, I think it's best to make the user opt into 
this behavior rather than do it magically.

On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik 
mailto:lc...@google.com>> wrote:
I would rather not have the builder method and run into the quota issue then 
require the builder method and still run into quota issues.

On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax 
mailto:re...@google.com>> wrote:
I'm a bit worried about making this automatic, as it can have unexpected side 
effects on BigQuery load-job quota. This is a 24-hour quota, so if it's 
accidentally exceeded all load jobs for the project may be blocked for the next 
24 hours. However if the user opts in (possibly via .a builder method), this 
seems like it could be automatic.

Reuven

On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik 
mailto:lc...@google.com>> wrote:
Having data ingestion work without needing to worry about how big the blobs are 
would be nice if it was automatic for users.

On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers 
mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Hey all,

The TableRow size limit is 1mb when streaming into bigquery.
To prevent data loss, I’m going to implement a TableRow size check and add a 
fan out to do a bigquery load job in case the size is above the limit.
Of course this load job would be windowed.

I know it doesn’t make sense to stream data bigger than 1mb, but as we’re using 
pub sub and want to make sure no data loss happens whatsoever, I’ll need to 
implement it.

Is this functionality any of you would like to see in BigqueryIO itself?
Or do you think my use case is too specific and implementing my solution around 
BigqueryIO will suffice.

Thanks for your thoughts,
Wout




Bigquery streaming TableRow size limit

2018-11-12 Thread Wout Scheepers
Hey all,

The TableRow size limit is 1mb when streaming into bigquery.
To prevent data loss, I’m going to implement a TableRow size check and add a 
fan out to do a bigquery load job in case the size is above the limit.
Of course this load job would be windowed.

I know it doesn’t make sense to stream data bigger than 1mb, but as we’re using 
pub sub and want to make sure no data loss happens whatsoever, I’ll need to 
implement it.

Is this functionality any of you would like to see in BigqueryIO itself?
Or do you think my use case is too specific and implementing my solution around 
BigqueryIO will suffice.

Thanks for your thoughts,
Wout




Running SpannerWriteIT on dataflow

2018-11-07 Thread Wout Scheepers
Hey all,

I’m still running into a bug when streaming into spanner, which I describe in 
the comments of https://issues.apache.org/jira/browse/BEAM-4796.
I think the cause is a missing equals method on SpannerSchema, for which I get 
a warning in the worker logs when running on Dataflow.

To reproduce this, I would like to run the SpannerWriteIT integration test on 
dataflow. Could anyone point me into the right direction on how to do this?

Thanks in advance
- Wout