[ 
https://issues.apache.org/jira/browse/BEAM-6443?focusedWorklogId=189207&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-189207
 ]

ASF GitHub Bot logged work on BEAM-6443:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Jan/19 22:11
            Start Date: 23/Jan/19 22:11
    Worklog Time Spent: 10m 
      Work Description: reuvenlax commented on issue #7547: [BEAM-6443] 
decrease the number of thread for BigQuery streaming inse…
URL: https://github.com/apache/beam/pull/7547#issuecomment-456986573
 
 
   I'm still worried about this.  A few concrete worries:
     * If latency on BigQuery inserts increases (which happens from time to 
time), today we will naturally start using more threads and still keep up. This 
change will break that, as everything will bottleneck behind a single thread. 
Testing this scenario is tricky (as BigQuery _usually_ has low latency on 
inserts).
   
    * The benchmark tests one very limited case. I would be far more worried 
about the case in which inserts are going to thousands of tables in parallel 
using DynamicDestinations; there are users that do such things, and I'm worried 
that this change will cause those writes to all be sequentialized on a single 
thread.
   
   * Another of my worries is that switching to a single thread will cap out 
the max throughput too much. I'm not sure that this benchmark approaches the 
max throughput.
   
   Essentially, switching to a single thread risks forcing the pipeline to 
become IO bound, which is a bad place for a streaming pipeline. We need a way 
to better prevent quota exceeds, however empirically most users are not hitting 
this problem so whatever we do shouldn't hurt the existing users (remember - we 
always tend to bias to noticing the failure conditions, as those are the users 
who report). Is there a way to improve our throttling without limiting things 
to a single thread?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 189207)
    Time Spent: 1h 10m  (was: 1h)

> decrease the number of threads for BigQuery streaming insertAll
> ---------------------------------------------------------------
>
>                 Key: BEAM-6443
>                 URL: https://issues.apache.org/jira/browse/BEAM-6443
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Heejong Lee
>            Assignee: Heejong Lee
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When inserting (a large number of ) very small elements into BigQuery via 
> streaming insertAll, BigQueryIO causes lots of quota exceeded errors. This 
> implies that 1) BigQueryIO puts unnecessary overheads on BigQuery API layer 
> by sending requests too fast 2) log file becomes very big because of repeated 
> same error messages. Currently we use 50 shards for writing data into 
> BigQuery and in each bundle 20-30 futures are executed simultaneously with 
> unlimited thread pool. It would be worth investigating whether just single 
> thread pool is sufficient for running concurrent insertAll.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to