Very cool Charles! Have you considered whether you'll want to remove the
direct runner code afterwards?
Best
-P.

On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:

> That is pretty awesome.
>
> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>
>> Local execution of Beam pipelines on the Python DirectRunner currently
>> suffers from performance issues, which makes it hard for pipeline authors
>> to iterate, especially on medium to large size datasets.  We would like to
>> optimize and make this a better experience for Beam users.
>>
>> The FnApiRunner was written as a way of leveraging the portability
>> framework execution code path for local portability development. We've
>> found it also provides great speedups in batch execution with no user
>> changes required, so we propose to switch to use this runner by default in
>> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
>> single CPU core now takes 50 seconds to run, compared to 12 minutes before;
>> this is a 15x performance improvement that users can get for free, with
>> no user pipeline changes.
>>
>> The JIRA for this change is here (
>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch
>> is available here (https://github.com/apache/beam/pull/4634). I have
>> been working over the last month on making this an automatic drop-in
>> replacement for the current DirectRunner when applicable.  Before it
>> becomes the default, you can try this runner now by manually specifying
>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner.
>>
>> Even with this change, local Python pipeline execution can only
>> effectively use one core because of the Python GIL.  A natural next step to
>> further improve performance will be to refactor the FnApiRunner to allow
>> for multi-process execution.  This is being tracked here (
>> https://issues.apache.org/jira/browse/BEAM-3645).
>>
>> Best,
>>
>> Charles
>>
>
>

-- 
Got feedback? go/pabloem-feedback

Reply via email to