Morten Andersen created BEAM-7094:
-------------------------------------

             Summary: DataflowRunner does not scale when reading gzip file
                 Key: BEAM-7094
                 URL: https://issues.apache.org/jira/browse/BEAM-7094
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow, sdk-py-core
    Affects Versions: 2.11.0
         Environment: Python on Dataflow
            Reporter: Morten Andersen


Hi,

I have a pipe that ReadFromText() a 700mb gz file from a GS bucket.

It then parse json, create BigQuery row, and WriteToBigQuery.

The pipeline above does not scale. If I specify 2 workers on startup it will 
scale it down to 1 and the throughput remains the same. The job takes 30 
minutes.

 

What I found is that the exact same pipeline, reading the same but uncompressed 
11gb file from the same location scales very well. The job only takes 5 minutes.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to