This is going to be a great improvement for our users! I'll take a
look at the pull request.

On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <k...@google.com> wrote:
> Nice!
>
> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote:
>>
>> The existing DirectRunner will be needed for the foreseeable future since
>> it is currently the only local runner that supports streaming execution.
>>
>>
>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote:
>>>
>>> Very cool Charles! Have you considered whether you'll want to remove the
>>> direct runner code afterwards?
>>> Best
>>> -P.
>>>
>>>
>>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>> That is pretty awesome.
>>>>
>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>>>>
>>>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>>>> suffers from performance issues, which makes it hard for pipeline authors 
>>>>> to
>>>>> iterate, especially on medium to large size datasets.  We would like to
>>>>> optimize and make this a better experience for Beam users.
>>>>>
>>>>>
>>>>> The FnApiRunner was written as a way of leveraging the portability
>>>>> framework execution code path for local portability development. We've 
>>>>> found
>>>>> it also provides great speedups in batch execution with no user changes
>>>>> required, so we propose to switch to use this runner by default in batch
>>>>> pipelines.  For example, WordCount on the Shakespeare dataset with a 
>>>>> single
>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes before; this 
>>>>> is
>>>>> a 15x performance improvement that users can get for free, with no user
>>>>> pipeline changes.
>>>>>
>>>>>
>>>>> The JIRA for this change is here
>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch 
>>>>> is
>>>>> available here (https://github.com/apache/beam/pull/4634). I have been
>>>>> working over the last month on making this an automatic drop-in 
>>>>> replacement
>>>>> for the current DirectRunner when applicable.  Before it becomes the
>>>>> default, you can try this runner now by manually specifying
>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner.
>>>>>
>>>>>
>>>>> Even with this change, local Python pipeline execution can only
>>>>> effectively use one core because of the Python GIL.  A natural next step 
>>>>> to
>>>>> further improve performance will be to refactor the FnApiRunner to allow 
>>>>> for
>>>>> multi-process execution.  This is being tracked here
>>>>> (https://issues.apache.org/jira/browse/BEAM-3645).
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Charles
>>>>
>>>>
>>>
>>>
>>> --
>>> Got feedback? go/pabloem-feedback
>
>

Reply via email to