Local execution of Beam pipelines on the Python DirectRunner currently
suffers from performance issues, which makes it hard for pipeline authors
to iterate, especially on medium to large size datasets.  We would like to
optimize and make this a better experience for Beam users.

The FnApiRunner was written as a way of leveraging the portability
framework execution code path for local portability development. We've
found it also provides great speedups in batch execution with no user
changes required, so we propose to switch to use this runner by default in
batch pipelines.  For example, WordCount on the Shakespeare dataset with a
single CPU core now takes 50 seconds to run, compared to 12 minutes before;
this is a 15x performance improvement that users can get for free, with no
user pipeline changes.

The JIRA for this change is here (
https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch is
available here (https://github.com/apache/beam/pull/4634). I have been
working over the last month on making this an automatic drop-in replacement
for the current DirectRunner when applicable.  Before it becomes the
default, you can try this runner now by manually specifying
apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner.

Even with this change, local Python pipeline execution can only effectively
use one core because of the Python GIL.  A natural next step to further
improve performance will be to refactor the FnApiRunner to allow for
multi-process execution.  This is being tracked here (
https://issues.apache.org/jira/browse/BEAM-3645).

Best,

Charles

Reply via email to