[
https://issues.apache.org/jira/browse/BEAM-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003170#comment-16003170
]
Ahmet Altay commented on BEAM-2208:
-----------------------------------
Thank you [[email protected]].
>From the linked job a few things are happening:
- Autoscaling cannot scale beyond 8 workers. This might be a quota issue on
your side.
- max_num_workers is not set. If this is not set, autoscaling will be capped at
15 workers. (Although you are not hitting this because of the above)
- It is possible that there is a hot key, which is adding to the execution time.
It would be most helpful if I can reproduce this case. From the title of the
issue I am assuming that you are using the wordcount example as is. Would it be
possible for you to share your input file (if it only contains dummy
information). Otherwise, would you check your quota and try running with
another input file?
> Python SDK wordcount on cloud Dataflow runner is slow
> -----------------------------------------------------
>
> Key: BEAM-2208
> URL: https://issues.apache.org/jira/browse/BEAM-2208
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow, sdk-py
> Affects Versions: 0.6.0
> Reporter: Anant Bhandarkar
> Assignee: Ahmet Altay
> Priority: Critical
>
> I have been trying to run the Beam Word count example with a 2GB file.
> When I run the Java Example for word count of this csv file the job gets
> completed in 7.15secs Mins.
> Job ID
> 2017-04-18_23_57_02-2832613177376293063
> But word count example with same file using Python SDK takes 28 to 35mins
> 2017-04-20_04_48_27-8924552896141769408
> SDK version
> Apache Beam SDK for Python 0.6.0
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)