Vincent Spiewak created BEAM-2840:
-------------------------------------

             Summary: BigQueryIO write is slow/fail with a bounded source
                 Key: BEAM-2840
                 URL: https://issues.apache.org/jira/browse/BEAM-2840
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-gcp
    Affects Versions: 2.0.0
         Environment: Gougle Cloud Platform
            Reporter: Vincent Spiewak
            Assignee: Chamikara Jayalath
         Attachments: Capture d’écran 2017-09-05 à 11.15.40.png

BigQueryIO Writer is slow / fail if the input source is bounded.

If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use 
the 
"[Method.FILE_LOADS|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1168]";
 instead of streaming inserts.

Large amounts of input datas result in a  java.lang.OutOfMemoryError / Java 
heap space (500 millions rows).

We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
[withMaxFilesPerBundle|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1131]
 is private :(

Someone reported a similar problem with GCS -> BQ on Stackoverflow: 
[Why is writing to BigQuery from a Dataflow/Beam pipeline 
slow?|https://stackoverflow.com/questions/45889992/why-is-writing-to-bigquery-from-a-dataflow-beam-pipeline-slow#comment78954153_45889992]





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to