[jira] [Commented] (BEAM-6064) Python BigQuery performance much worse than Java

Javier Domingo Cansino (JIRA) Fri, 01 Feb 2019 02:19:24 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758175#comment-16758175
 ]


Javier Domingo Cansino commented on BEAM-6064:
----------------------------------------------

I have been unable to activate the experiment successfully. As you can see on 
the bottom right side it appears as if the experiment were activated, but there 
is no performance improvement. The java version runs the same read in 2min44s 
cpu time. 

!Screenshot from 2019-02-01 10-10-45.png! 

Is there any extra step besides being in version 2.9.0 and adding use_fastavro 
to experiments?

> Python BigQuery performance much worse than Java
> ------------------------------------------------
>
>                 Key: BEAM-6064
>                 URL: https://issues.apache.org/jira/browse/BEAM-6064
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.8.0
>            Reporter: Jan Kuipers
>            Assignee: Chamikara Jayalath
>            Priority: Major
>             Fix For: 2.9.0
>
>         Attachments: Screenshot from 2019-02-01 10-10-45.png, 
> results-java.png, results-python.png
>
>
> The performance of reading from BigQuery in Python seems to be much worse 
> than the performance of it in Java.
> To reproduce this, I've run the following two programs on the Google Cloud, 
> which basically read the weights from the public data set "natality" and 
> outputs the top 100 largest weights.
> Python:
> {code:java}
> # <cut imports>
> options = PipelineOptions()
> options.view_as(StandardOptions).runner = 'DataflowRunner'
> # <cut more options>
> pipeline = Pipeline(options=options)
> (pipeline
>     | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT 
> weight_pounds FROM [bigquery-public-data:samples.natality]'))
>     | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
>     | 'Top' >> beam.combiners.Top.Largest(100)
>     | 'MapToString' >> beam.Map(lambda elem: str(elem))
>     | 'Write' >> beam.io.WriteToText("<output-file>"))
> pipeline.run()
> {code}
>  Java:
> {code:java}
> // <cut imports>
> public class Natality {
>     public static void main(String[] args) {
>         DataflowPipelineOptions options = 
> PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
>         options.setRunner(DataflowRunner.class);
>         // <cut more options>
>         
>         Pipeline pipeline = Pipeline.create(options);
>         pipeline.apply("Read", BigQueryIO.readTableRows()
>             .fromQuery("SELECT weight_pounds FROM 
> [bigquery-public-data:samples.natality]"))
>             .apply("MapToDouble", MapElements
>                 .into(TypeDescriptors.doubles())
>                 .via(row -> {
>                      Object obj = row.get("weight_pounds");
>                      return (obj == null ? 0.0 : (Double) obj);
>                 }))
>             .apply("Top", Top.largest(100))
>             .apply("MapToString", MapElements
>                 .into(TypeDescriptors.strings())
>                 .via(weight -> weight.toString()))
>             .apply("Write", TextIO.write().to("<output-file>"));
>         pipeline.run().waitUntilFinish();
>     }
> }
> {code}
> The "<cut more options>" are basic options like project, job name, temp 
> location, etc. Both programs produce identical outputs.
> Running these programs launches a DataFlow job on the Google Cloud with the 
> following results (data from the Google Cloud Platform web interface; 
> screenshots attached).
> Python:
> {noformat}
> Read Succeeded 1 hr 40 min 40 sec
> MapToFloat Succeeded 2 min 43 sec
> Top Succeeded 5 min 25 sec
> MapToString Succeeded 0 sec
> Write Succeeded 3 sec{noformat}
> Java:
> {noformat}
> Read Succeeded 4 min 45 sec
> MapToDouble Succeeded 45 sec
> Top Succeeded 52 sec
> MapToString Succeeded 0 sec
> Write Succeeded 1 sec
> {noformat}
> As you can see, there is an enormous performance hit in Python w.r.t. the 
> reading from BigQuery: 1h40m vs less than 5 minutes.
> Furthermore the other standard operations (like Top) are also much slower in 
> Python than in Java.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6064) Python BigQuery performance much worse than Java

Reply via email to