This sounds a bit more specific, so I wouldn't add this to BigQueryIO yet. On Thu, Nov 15, 2018 at 6:58 PM Wout Scheepers < [email protected]> wrote:
> Thanks for your thoughts. > > Also, I’m doing something similar when streaming data into partitioned > tables. > > From [1]: > > “ When the data is streamed, data between 7 days in the past and 3 days > in the future is placed in the streaming buffer, and then it is extracted > to the corresponding partitions.” > > > > I added a check to see if the event time is within this timebound. If not, > a load job is triggered. This can happen when we replay old data. > > > > Do you also think this would be worth adding to BigqueryIO? > > If so, I’ll try to create a PR for both features. > > > > Thanks, > > Wout > > > > [1] : > https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables > > > > > > *From: *Reuven Lax <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, 14 November 2018 at 14:51 > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Bigquery streaming TableRow size limit > > > > Generally I would agree, but the consequences here of a mistake are > severe. Not only will the beam pipeline get stuck for 24 hours, _anything_ > else in the user's GCP project that tries to load data into BigQuery will > also fail for the next 24 hours. Given the severity, I think it's best to > make the user opt into this behavior rather than do it magically. > > > > On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik <[email protected]> wrote: > > I would rather not have the builder method and run into the quota issue > then require the builder method and still run into quota issues. > > > > On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <[email protected]> wrote: > > I'm a bit worried about making this automatic, as it can have unexpected > side effects on BigQuery load-job quota. This is a 24-hour quota, so if > it's accidentally exceeded all load jobs for the project may be blocked for > the next 24 hours. However if the user opts in (possibly via .a builder > method), this seems like it could be automatic. > > > > Reuven > > > > On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <[email protected]> wrote: > > Having data ingestion work without needing to worry about how big the > blobs are would be nice if it was automatic for users. > > > > On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers < > [email protected]> wrote: > > Hey all, > > > > The TableRow size limit is 1mb when streaming into bigquery. > > To prevent data loss, I’m going to implement a TableRow size check and add > a fan out to do a bigquery load job in case the size is above the limit. > > Of course this load job would be windowed. > > > > I know it doesn’t make sense to stream data bigger than 1mb, but as we’re > using pub sub and want to make sure no data loss happens whatsoever, I’ll > need to implement it. > > > > Is this functionality any of you would like to see in BigqueryIO itself? > > Or do you think my use case is too specific and implementing my solution > around BigqueryIO will suffice. > > > > Thanks for your thoughts, > > Wout > > > > > >
