[
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886245#comment-15886245
]
Ahmet Altay commented on BEAM-1442:
-----------------------------------
It is correct that python DirectRunner will utilize only one core. It actually
has a multi-threaded implementation and the executor will run the pipeline
using a thread pool. The problem is, due to GIL the threads will not run in
parallel. In order to get real parallelism the DirectRunner need to be
converted using multiple processes.
multiprocessing module is well suited for this. However, another important
thing here is that the data will be serialized across processes. This has two
implications, 1. DirectRunner needs to be updated in a way that work items are
picklable. 2. We need to pay attention to serializations costs. Perhaps using a
mix of per-procees Queue's and a master work Queue.
Another important point is that, some work needs to happen in a single process
(e.g. GroupByKey), to cover such cases there needs to be a concept of process
affinity.
> Performance improvement of the Python DirectRunner
> --------------------------------------------------
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py
> Reporter: Pablo Estrada
> Assignee: Ahmet Altay
> Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers,
> and correctness checkers for Beam pipelines; but there are users that run
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although
> some work has gone into improving it. There are more opportunities for
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline,
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions
> directly on JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)