Pablo Estrada created BEAM-8012:
-----------------------------------

             Summary: Perf improvements for Python WriteToBigQuery with 
Streaming Inserts
                 Key: BEAM-8012
                 URL: https://issues.apache.org/jira/browse/BEAM-8012
             Project: Beam
          Issue Type: Improvement
          Components: io-py-gcp
            Reporter: Pablo Estrada


Users have reported that for a pipeline that is able to process 400 msg/sec/cpu 
drops to 75 msg/sec/cpu when adding the WriteToBigQuery sink from the Python 
SDK.

Some candidates to be optimized:
 * 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L776-L805]
 - The GetTable method gets called, sometimes veeery often.
 * 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1017-L1019]
 - The RowAsDictJsonCoder does special treatment of bytes, and for that it 
iterates through the whole record first.
 * 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L823-L840]
 - The batching strategy for the Writing DoFn may be improved?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to