Pablo Estrada created BEAM-8012:
-----------------------------------
Summary: Perf improvements for Python WriteToBigQuery with
Streaming Inserts
Key: BEAM-8012
URL: https://issues.apache.org/jira/browse/BEAM-8012
Project: Beam
Issue Type: Improvement
Components: io-py-gcp
Reporter: Pablo Estrada
Users have reported that for a pipeline that is able to process 400 msg/sec/cpu
drops to 75 msg/sec/cpu when adding the WriteToBigQuery sink from the Python
SDK.
Some candidates to be optimized:
*
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L776-L805]
- The GetTable method gets called, sometimes veeery often.
*
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1017-L1019]
- The RowAsDictJsonCoder does special treatment of bytes, and for that it
iterates through the whole record first.
*
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L823-L840]
- The batching strategy for the Writing DoFn may be improved?
--
This message was sent by Atlassian Jira
(v8.3.2#803003)