That is pretty awesome.

On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:

> Local execution of Beam pipelines on the Python DirectRunner currently
> suffers from performance issues, which makes it hard for pipeline authors
> to iterate, especially on medium to large size datasets.  We would like to
> optimize and make this a better experience for Beam users.
>
> The FnApiRunner was written as a way of leveraging the portability
> framework execution code path for local portability development. We've
> found it also provides great speedups in batch execution with no user
> changes required, so we propose to switch to use this runner by default in
> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
> single CPU core now takes 50 seconds to run, compared to 12 minutes before;
> this is a 15x performance improvement that users can get for free, with
> no user pipeline changes.
>
> The JIRA for this change is here (https://issues.apache.org/
> jira/browse/BEAM-3644), and a candidate patch is available here (
> https://github.com/apache/beam/pull/4634). I have been working over the
> last month on making this an automatic drop-in replacement for the current
> DirectRunner when applicable.  Before it becomes the default, you can try
> this runner now by manually specifying apache_beam.runners.
> portability.fn_api_runner.FnApiRunner as the runner.
>
> Even with this change, local Python pipeline execution can only
> effectively use one core because of the Python GIL.  A natural next step to
> further improve performance will be to refactor the FnApiRunner to allow
> for multi-process execution.  This is being tracked here (
> https://issues.apache.org/jira/browse/BEAM-3645).
>
> Best,
>
> Charles
>

Reply via email to