[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858831#comment-15858831
 ] 

Pablo Estrada commented on BEAM-1442:
-------------------------------------

It would be useful for a proposal to include a benchmark so we can measure 
improvements provided by the project.

After a discussion with Charles, here's some ideas that came out:
There are a few possibilities for this project. One of them is to make the code 
more efficient by adding Cython annotations, and profiling existing operations 
to make them as fast as possible.
Another is to make it work in multiple processes. The direct runner currently 
works in a single Python process, and therefore only one thread can make 
progress at any given time. Under this idea, one might want to have a process 
dedicated to shuffle operations.

> Performance improvement of the Python DirectRunner
> --------------------------------------------------
>
>                 Key: BEAM-1442
>                 URL: https://issues.apache.org/jira/browse/BEAM-1442
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Pablo Estrada
>            Assignee: Ahmet Altay
>              Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to