Chamikara Jayalath created BEAM-5105:
----------------------------------------

             Summary: Move load job poll to finishBundle() method to better 
parallelize execution
                 Key: BEAM-5105
                 URL: https://issues.apache.org/jira/browse/BEAM-5105
             Project: Beam
          Issue Type: Improvement
          Components: io-java-gcp
            Reporter: Chamikara Jayalath


It appears that when we write to BigQuery using WriteTablesDoFn we start a load 
job and wait for that job to finish.

[https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]

 

In cases where we are trying to write a PCollection of tables (for example, 
when user use dynamic destinations feature) this relies on dynamic work 
rebalancing to parallellize execution of load jobs. If the runner does not 
support dynamic work rebalancing or does not execute dynamic work rebalancing 
from some reason this could have significant performance drawbacks. For 
example, scheduling times for load jobs will add up.

 

A better approach might be to start load jobs at process() method but wait for 
all load jobs to finish at finishBundle() method. This will parallelize any 
overheads as well as job execution (assuming more than one job is schedule by 
BQ.).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to