This sounds a bit more specific, so I wouldn't add this to BigQueryIO yet.

On Thu, Nov 15, 2018 at 6:58 PM Wout Scheepers <
[email protected]> wrote:

> Thanks for your thoughts.
>
> Also, I’m doing something similar when streaming data into partitioned
> tables.
>
> From [1]:
>
> “ When the data is streamed, data between 7 days in the past and 3 days
> in the future is placed in the streaming buffer, and then it is extracted
> to the corresponding partitions.”
>
>
>
> I added a check to see if the event time is within this timebound. If not,
> a load job is triggered. This can happen when we replay old data.
>
>
>
> Do you also think this would be worth adding to BigqueryIO?
>
> If so, I’ll try to create a PR for both features.
>
>
>
> Thanks,
>
> Wout
>
>
>
> [1] :
> https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables
>
>
>
>
>
> *From: *Reuven Lax <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, 14 November 2018 at 14:51
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Bigquery streaming TableRow size limit
>
>
>
> Generally I would agree, but the consequences here of a mistake are
> severe. Not only will the beam pipeline get stuck for 24 hours, _anything_
> else in the user's GCP project that tries to load data into BigQuery will
> also fail for the next 24 hours. Given the severity, I think it's best to
> make the user opt into this behavior rather than do it magically.
>
>
>
> On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik <[email protected]> wrote:
>
> I would rather not have the builder method and run into the quota issue
> then require the builder method and still run into quota issues.
>
>
>
> On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <[email protected]> wrote:
>
> I'm a bit worried about making this automatic, as it can have unexpected
> side effects on BigQuery load-job quota. This is a 24-hour quota, so if
> it's accidentally exceeded all load jobs for the project may be blocked for
> the next 24 hours. However if the user opts in (possibly via .a builder
> method), this seems like it could be automatic.
>
>
>
> Reuven
>
>
>
> On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <[email protected]> wrote:
>
> Having data ingestion work without needing to worry about how big the
> blobs are would be nice if it was automatic for users.
>
>
>
> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
> [email protected]> wrote:
>
> Hey all,
>
>
>
> The TableRow size limit is 1mb when streaming into bigquery.
>
> To prevent data loss, I’m going to implement a TableRow size check and add
> a fan out to do a bigquery load job in case the size is above the limit.
>
> Of course this load job would be windowed.
>
>
>
> I know it doesn’t make sense to stream data bigger than 1mb, but as we’re
> using pub sub and want to make sure no data loss happens whatsoever, I’ll
> need to implement it.
>
>
>
> Is this functionality any of you would like to see in BigqueryIO itself?
>
> Or do you think my use case is too specific and implementing my solution
> around BigqueryIO will suffice.
>
>
>
> Thanks for your thoughts,
>
> Wout
>
>
>
>
>
>

Reply via email to