Nice!

On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote:

> The existing DirectRunner will be needed for the foreseeable future since
> it is currently the only local runner that supports streaming execution.
>
> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote:
>
>> Very cool Charles! Have you considered whether you'll want to remove the
>> direct runner code afterwards?
>> Best
>> -P.
>>
>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> That is pretty awesome.
>>>
>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>>
>>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>>> suffers from performance issues, which makes it hard for pipeline authors
>>>> to iterate, especially on medium to large size datasets.  We would like to
>>>> optimize and make this a better experience for Beam users.
>>>>
>>>> The FnApiRunner was written as a way of leveraging the portability
>>>> framework execution code path for local portability development. We've
>>>> found it also provides great speedups in batch execution with no user
>>>> changes required, so we propose to switch to use this runner by default in
>>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
>>>> single CPU core now takes 50 seconds to run, compared to 12 minutes before;
>>>> this is a 15x performance improvement that users can get for free,
>>>> with no user pipeline changes.
>>>>
>>>> The JIRA for this change is here (https://issues.apache.org/
>>>> jira/browse/BEAM-3644), and a candidate patch is available here (
>>>> https://github.com/apache/beam/pull/4634). I have been working over
>>>> the last month on making this an automatic drop-in replacement for the
>>>> current DirectRunner when applicable.  Before it becomes the default, you
>>>> can try this runner now by manually specifying apache_beam.runners.
>>>> portability.fn_api_runner.FnApiRunner as the runner.
>>>>
>>>> Even with this change, local Python pipeline execution can only
>>>> effectively use one core because of the Python GIL.  A natural next step to
>>>> further improve performance will be to refactor the FnApiRunner to allow
>>>> for multi-process execution.  This is being tracked here (
>>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>>
>>>> Best,
>>>>
>>>> Charles
>>>>
>>>
>>>
>>
>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>

Reply via email to