[jira] [Commented] (BEAM-8012) Perf improvements for Python WriteToBigQuery with Streaming Inserts

Pablo Estrada (Jira) Tue, 20 Aug 2019 08:45:22 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911475#comment-16911475
 ]


Pablo Estrada commented on BEAM-8012:
-------------------------------------

Assigning to Tanay to take a look.

> Perf improvements for Python WriteToBigQuery with Streaming Inserts
> -------------------------------------------------------------------
>
>                 Key: BEAM-8012
>                 URL: https://issues.apache.org/jira/browse/BEAM-8012
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-py-gcp
>            Reporter: Pablo Estrada
>            Assignee: Tanay Tummalapalli
>            Priority: Major
>
> Users have reported that for a pipeline that is able to process 400 
> msg/sec/cpu drops to 75 msg/sec/cpu when adding the WriteToBigQuery sink from 
> the Python SDK.
> Some candidates to be optimized:
>  * 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L776-L805]
>  - The GetTable method gets called, sometimes veeery often.
>  * 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1017-L1019]
>  - The RowAsDictJsonCoder does special treatment of bytes, and for that it 
> iterates through the whole record first.
>  * 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L823-L840]
>  - The batching strategy for the Writing DoFn may be improved?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8012) Perf improvements for Python WriteToBigQuery with Streaming Inserts

Reply via email to