[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886226#comment-15886226
 ] 

Haoxiang commented on BEAM-1442:
--------------------------------

Hi altay:

I make a simple benchmark between python's wordcount and Java's wordcount in 
beam, the benchmark data I write a simple script to generate data, the script's 
logic is that, every line in the data file I generate 5-10 string randomly, I 
have a word_map, if map.size() > 10000, then generate the string from the map, 
it can make word's number have a baseline. then I generate 1000lines, 
10000lines, 100000lines, 300000lines data, below is the benchmark:


(cost time)  1000(90K)     10000(889K)    100000(8.7M)   300000(26M)   

python        2.389s           11.982s             110.23s            327.68s

java            6.338s            9.480s              21.846s            46.234s

my machine's configuration is Mac pro, 2.2 GHz Intel Core i7, 16GB memory. 

it can show that python's performance have a large gap with java, from the 
monitor on my machine, I find that java progress will cost cpu to 600%, and 
python is about 99% all the time, then I saw the souce code direct_runner.py, 
87 line show that the executor will run only in background thread, so maybe we 
can from this way to make it to run in mulity threads, maybe it can make 
python's runner performance improve a lot.

> Performance improvement of the Python DirectRunner
> --------------------------------------------------
>
>                 Key: BEAM-1442
>                 URL: https://issues.apache.org/jira/browse/BEAM-1442
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Pablo Estrada
>            Assignee: Ahmet Altay
>              Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to