[
https://issues.apache.org/jira/browse/BEAM-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002132#comment-16002132
]
Anant Bhandarkar commented on BEAM-2208:
----------------------------------------
[~altay] This word count job was run yesterday.
2017-05-08_02_48_51-5929018952297525369
We tried to increase the number of worker instance to 50 instead of autoscale
but it only took max 2 workers and took 34 min 54 sec to execute.
Wondering what will ensure that the work is distributed among the workers also
what will bring about such difference in execution times compared to Java in a
word count scenario.
> Python SDK wordcount on cloud Dataflow runner is slow
> -----------------------------------------------------
>
> Key: BEAM-2208
> URL: https://issues.apache.org/jira/browse/BEAM-2208
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow, sdk-py
> Affects Versions: 0.6.0
> Reporter: Anant Bhandarkar
> Assignee: Ahmet Altay
> Priority: Critical
>
> I have been trying to run the Beam Word count example with a 2GB file.
> When I run the Java Example for word count of this csv file the job gets
> completed in 7.15secs Mins.
> Job ID
> 2017-04-18_23_57_02-2832613177376293063
> But word count example with same file using Python SDK takes 28 to 35mins
> 2017-04-20_04_48_27-8924552896141769408
> SDK version
> Apache Beam SDK for Python 0.6.0
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)