Thanks Heejong.
I agree that writing to a service using 50 unlimited threadpools sounds
excessive and can result in flooding that service (BigQuery in this case)
in error scenarios where we should backoff. Determining a suitable and
limited amount of parallelization (50 in this case) sounds good to me.
Thanks,
Cham
On Wed, Jan 16, 2019 at 6:53 PM Heejong Lee wrote:
> Hi,
>
> I want to suggest the change[1] of the thread pool type in BigQuery
> streaming insert for Java SDK (BEAM-6443). When we insert small data into
> BigQuery very fast by using BigQueryIO.write, it generates lots of rate
> limit exceeded errors in a log file. It's mainly because the number of
> threads to be used for the inserting job is just too large (50 shards *
> dozens of futures executed by unlimited thread pool per each bundle). I've
> conducted some benchmarks[2] and could see that the change from unlimited
> thread pool to single thread pool reduces the number of (same repeated,
> possibly meaningless) error messages by 1/4 while retaining the same
> performance. I think that this change will not break any important
> performance measure but if anybody has any concerns about this change
> please let me know.
>
> Thanks,
>
> [1] https://github.com/apache/beam/pull/7547
> [2]
> https://docs.google.com/document/d/1EhRNWLevm86GD_QtvlrTauHITVMwQBzuemyp-w4Z_ck/edit#heading=h.c0angyd9tn21
>