[
https://issues.apache.org/jira/browse/BEAM-5105?focusedWorklogId=150532&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-150532
]
ASF GitHub Bot logged work on BEAM-5105:
----------------------------------------
Author: ASF GitHub Bot
Created on: 02/Oct/18 21:26
Start Date: 02/Oct/18 21:26
Worklog Time Spent: 10m
Work Description: reuvenlax commented on issue #6416: [BEAM-5105] Better
parallelize BigQuery load jobs
URL: https://github.com/apache/beam/pull/6416#issuecomment-426436189
@aaltay most comments addressed. I just want to write some more unit tests
before submitting.
@chamikaramj Do we have an large-scale BQIO tests we can run against actual
BQ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 150532)
Time Spent: 1h 10m (was: 1h)
> Move load job poll to finishBundle() method to better parallelize execution
> ---------------------------------------------------------------------------
>
> Key: BEAM-5105
> URL: https://issues.apache.org/jira/browse/BEAM-5105
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
> Reporter: Chamikara Jayalath
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> It appears that when we write to BigQuery using WriteTablesDoFn we start a
> load job and wait for that job to finish.
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
>
> In cases where we are trying to write a PCollection of tables (for example,
> when user use dynamic destinations feature) this relies on dynamic work
> rebalancing to parallellize execution of load jobs. If the runner does not
> support dynamic work rebalancing or does not execute dynamic work rebalancing
> from some reason this could have significant performance drawbacks. For
> example, scheduling times for load jobs will add up.
>
> A better approach might be to start load jobs at process() method but wait
> for all load jobs to finish at finishBundle() method. This will parallelize
> any overheads as well as job execution (assuming more than one job is
> schedule by BQ.).
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)