[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886245#comment-15886245
 ] 

Ahmet Altay commented on BEAM-1442:
-----------------------------------

It is correct that python DirectRunner will utilize only one core. It actually 
has a multi-threaded implementation and the executor will run the pipeline 
using a thread pool. The problem is, due to GIL the threads will not run in 
parallel. In order to get real parallelism the DirectRunner need to be 
converted using multiple processes. 

multiprocessing module is well suited for this. However, another important 
thing here is that the data will be serialized across processes. This has two 
implications, 1. DirectRunner needs to be updated in a way that work items are 
picklable. 2. We need to pay attention to serializations costs. Perhaps using a 
mix of per-procees Queue's and a master work Queue. 

Another important point is that, some work needs to happen in a single process 
(e.g. GroupByKey), to cover such cases there needs to be a concept of process 
affinity.

> Performance improvement of the Python DirectRunner
> --------------------------------------------------
>
>                 Key: BEAM-1442
>                 URL: https://issues.apache.org/jira/browse/BEAM-1442
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Pablo Estrada
>            Assignee: Ahmet Altay
>              Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to