Re: [PROPOSAL] decrease the number of threads for BigQuery streaming insertAll

2019-01-17 Thread Chamikara Jayalath
Thanks Heejong.

I agree that writing to a service using 50 unlimited threadpools sounds
excessive and can result in flooding that service (BigQuery in this case)
in error scenarios where we should backoff. Determining a suitable and
limited amount of parallelization (50 in this case) sounds good to me.

Thanks,
Cham

On Wed, Jan 16, 2019 at 6:53 PM Heejong Lee  wrote:

> Hi,
>
> I want to suggest the change[1] of the thread pool type in BigQuery
> streaming insert for Java SDK (BEAM-6443). When we insert small data into
> BigQuery very fast by using BigQueryIO.write, it generates lots of rate
> limit exceeded errors in a log file. It's mainly because the number of
> threads to be used for the inserting job is just too large (50 shards *
> dozens of futures executed by unlimited thread pool per each bundle). I've
> conducted some benchmarks[2] and could see that the change from unlimited
> thread pool to single thread pool reduces the number of (same repeated,
> possibly meaningless) error messages by 1/4 while retaining the same
> performance. I think that this change will not break any important
> performance measure but if anybody has any concerns about this change
> please let me know.
>
> Thanks,
>
> [1] https://github.com/apache/beam/pull/7547
> [2]
> https://docs.google.com/document/d/1EhRNWLevm86GD_QtvlrTauHITVMwQBzuemyp-w4Z_ck/edit#heading=h.c0angyd9tn21
>


[PROPOSAL] decrease the number of threads for BigQuery streaming insertAll

2019-01-16 Thread Heejong Lee
Hi,

I want to suggest the change[1] of the thread pool type in BigQuery
streaming insert for Java SDK (BEAM-6443). When we insert small data into
BigQuery very fast by using BigQueryIO.write, it generates lots of rate
limit exceeded errors in a log file. It's mainly because the number of
threads to be used for the inserting job is just too large (50 shards *
dozens of futures executed by unlimited thread pool per each bundle). I've
conducted some benchmarks[2] and could see that the change from unlimited
thread pool to single thread pool reduces the number of (same repeated,
possibly meaningless) error messages by 1/4 while retaining the same
performance. I think that this change will not break any important
performance measure but if anybody has any concerns about this change
please let me know.

Thanks,

[1] https://github.com/apache/beam/pull/7547
[2]
https://docs.google.com/document/d/1EhRNWLevm86GD_QtvlrTauHITVMwQBzuemyp-w4Z_ck/edit#heading=h.c0angyd9tn21