[
https://issues.apache.org/jira/browse/BEAM-6064?focusedWorklogId=467565&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467565
]
ASF GitHub Bot logged work on BEAM-6064:
----------------------------------------
Author: ASF GitHub Bot
Created on: 06/Aug/20 21:00
Start Date: 06/Aug/20 21:00
Worklog Time Spent: 10m
Work Description: pabloem opened a new pull request #12485:
URL: https://github.com/apache/beam/pull/12485
Consider results from this analysis:
https://docs.google.com/document/d/1s8VRkN4qKdgGkDOZQiwowmD3GVyVV9UskJQwTdKfRCE/edit#
This change helps increase EPS per worker from ~100 to >1000 per worker.
Also improves documentation and telemetry.
------------------------
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [ ] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
See the [Contributor Guide](https://beam.apache.org/contribute) for more
tips on [how to make review process
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
--- | --- | --- | --- | --- | --- | ---
Go | [](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | [](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
| --- | [](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
| ---
Java | [](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
Python | [](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/)
| --- | [](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
| ---
XLang | [](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/)
| --- | [](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/)
| --- | [](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
| ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
------------------------------------------------------------------------------------------------

See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more
information about GitHub Actions CI.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 467565)
Remaining Estimate: 0h
Time Spent: 10m
> Python BigQuery performance much worse than Java
> ------------------------------------------------
>
> Key: BEAM-6064
> URL: https://issues.apache.org/jira/browse/BEAM-6064
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.8.0
> Reporter: Jan Kuipers
> Assignee: Pablo Estrada
> Priority: P2
> Attachments: Screenshot from 2019-02-01 10-10-45.png,
> results-java.png, results-python.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The performance of reading from BigQuery in Python seems to be much worse
> than the performance of it in Java.
> To reproduce this, I've run the following two programs on the Google Cloud,
> which basically read the weights from the public data set "natality" and
> outputs the top 100 largest weights.
> Python:
> {code:java}
> # <cut imports>
> options = PipelineOptions()
> options.view_as(StandardOptions).runner = 'DataflowRunner'
> # <cut more options>
> pipeline = Pipeline(options=options)
> (pipeline
> | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT
> weight_pounds FROM [bigquery-public-data:samples.natality]'))
> | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
> | 'Top' >> beam.combiners.Top.Largest(100)
> | 'MapToString' >> beam.Map(lambda elem: str(elem))
> | 'Write' >> beam.io.WriteToText("<output-file>"))
> pipeline.run()
> {code}
> Java:
> {code:java}
> // <cut imports>
> public class Natality {
> public static void main(String[] args) {
> DataflowPipelineOptions options =
> PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
> options.setRunner(DataflowRunner.class);
> // <cut more options>
>
> Pipeline pipeline = Pipeline.create(options);
> pipeline.apply("Read", BigQueryIO.readTableRows()
> .fromQuery("SELECT weight_pounds FROM
> [bigquery-public-data:samples.natality]"))
> .apply("MapToDouble", MapElements
> .into(TypeDescriptors.doubles())
> .via(row -> {
> Object obj = row.get("weight_pounds");
> return (obj == null ? 0.0 : (Double) obj);
> }))
> .apply("Top", Top.largest(100))
> .apply("MapToString", MapElements
> .into(TypeDescriptors.strings())
> .via(weight -> weight.toString()))
> .apply("Write", TextIO.write().to("<output-file>"));
> pipeline.run().waitUntilFinish();
> }
> }
> {code}
> The "<cut more options>" are basic options like project, job name, temp
> location, etc. Both programs produce identical outputs.
> Running these programs launches a DataFlow job on the Google Cloud with the
> following results (data from the Google Cloud Platform web interface;
> screenshots attached).
> Python:
> {noformat}
> Read Succeeded 1 hr 40 min 40 sec
> MapToFloat Succeeded 2 min 43 sec
> Top Succeeded 5 min 25 sec
> MapToString Succeeded 0 sec
> Write Succeeded 3 sec{noformat}
> Java:
> {noformat}
> Read Succeeded 4 min 45 sec
> MapToDouble Succeeded 45 sec
> Top Succeeded 52 sec
> MapToString Succeeded 0 sec
> Write Succeeded 1 sec
> {noformat}
> As you can see, there is an enormous performance hit in Python w.r.t. the
> reading from BigQuery: 1h40m vs less than 5 minutes.
> Furthermore the other standard operations (like Top) are also much slower in
> Python than in Java.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)