[
https://issues.apache.org/jira/browse/BEAM-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vincent Spiewak updated BEAM-2840:
----------------------------------
Description:
BigQueryIO Writer is slow / fail if the input source is bounded.
If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use
the
"[Method.FILE_LOADS|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1168]"
instead of streaming inserts.
Large amounts of input datas result in a java.lang.OutOfMemoryError / Java
heap space (500 millions rows).
!attachment-name.jpg|thumbnail!
We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
[withMaxFilesPerBundle|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1131]
is private :(
Someone reported a similar problem with GCS -> BQ on Stackoverflow:
[Why is writing to BigQuery from a Dataflow/Beam pipeline
slow?|https://stackoverflow.com/questions/45889992/why-is-writing-to-bigquery-from-a-dataflow-beam-pipeline-slow#comment78954153_45889992]
was:
BigQueryIO Writer is slow / fail if the input source is bounded.
If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use
the
"[Method.FILE_LOADS|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1168]"
instead of streaming inserts.
Large amounts of input datas result in a java.lang.OutOfMemoryError / Java
heap space (500 millions rows).
We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
[withMaxFilesPerBundle|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1131]
is private :(
Someone reported a similar problem with GCS -> BQ on Stackoverflow:
[Why is writing to BigQuery from a Dataflow/Beam pipeline
slow?|https://stackoverflow.com/questions/45889992/why-is-writing-to-bigquery-from-a-dataflow-beam-pipeline-slow#comment78954153_45889992]
> BigQueryIO write is slow/fail with a bounded source
> ---------------------------------------------------
>
> Key: BEAM-2840
> URL: https://issues.apache.org/jira/browse/BEAM-2840
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-gcp
> Affects Versions: 2.0.0
> Environment: Gougle Cloud Platform
> Reporter: Vincent Spiewak
> Assignee: Chamikara Jayalath
> Attachments: PrepareWrite.BatchLoads.png
>
>
> BigQueryIO Writer is slow / fail if the input source is bounded.
> If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use
> the
> "[Method.FILE_LOADS|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1168]"
> instead of streaming inserts.
> Large amounts of input datas result in a java.lang.OutOfMemoryError / Java
> heap space (500 millions rows).
> !attachment-name.jpg|thumbnail!
> We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
> [withMaxFilesPerBundle|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1131]
> is private :(
> Someone reported a similar problem with GCS -> BQ on Stackoverflow:
> [Why is writing to BigQuery from a Dataflow/Beam pipeline
> slow?|https://stackoverflow.com/questions/45889992/why-is-writing-to-bigquery-from-a-dataflow-beam-pipeline-slow#comment78954153_45889992]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)