Chamikara Jayalath created BEAM-5105:
----------------------------------------
Summary: Move load job poll to finishBundle() method to better
parallelize execution
Key: BEAM-5105
URL: https://issues.apache.org/jira/browse/BEAM-5105
Project: Beam
Issue Type: Improvement
Components: io-java-gcp
Reporter: Chamikara Jayalath
It appears that when we write to BigQuery using WriteTablesDoFn we start a load
job and wait for that job to finish.
[https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
In cases where we are trying to write a PCollection of tables (for example,
when user use dynamic destinations feature) this relies on dynamic work
rebalancing to parallellize execution of load jobs. If the runner does not
support dynamic work rebalancing or does not execute dynamic work rebalancing
from some reason this could have significant performance drawbacks. For
example, scheduling times for load jobs will add up.
A better approach might be to start load jobs at process() method but wait for
all load jobs to finish at finishBundle() method. This will parallelize any
overheads as well as job execution (assuming more than one job is schedule by
BQ.).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)