Sounds impressive, and with the extra portability stuff, great !
Worth the switch just for he user experience improvement.

On Thu, Feb 8, 2018 at 5:52 PM, Robert Bradshaw <rober...@google.com> wrote:
> This is going to be a great improvement for our users! I'll take a
> look at the pull request.
>
> On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <k...@google.com> wrote:
>> Nice!
>>
>> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote:
>>>
>>> The existing DirectRunner will be needed for the foreseeable future since
>>> it is currently the only local runner that supports streaming execution.
>>>
>>>
>>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote:
>>>>
>>>> Very cool Charles! Have you considered whether you'll want to remove the
>>>> direct runner code afterwards?
>>>> Best
>>>> -P.
>>>>
>>>>
>>>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>> That is pretty awesome.
>>>>>
>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>>>>>
>>>>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>>>>> suffers from performance issues, which makes it hard for pipeline 
>>>>>> authors to
>>>>>> iterate, especially on medium to large size datasets.  We would like to
>>>>>> optimize and make this a better experience for Beam users.
>>>>>>
>>>>>>
>>>>>> The FnApiRunner was written as a way of leveraging the portability
>>>>>> framework execution code path for local portability development. We've 
>>>>>> found
>>>>>> it also provides great speedups in batch execution with no user changes
>>>>>> required, so we propose to switch to use this runner by default in batch
>>>>>> pipelines.  For example, WordCount on the Shakespeare dataset with a 
>>>>>> single
>>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes before; 
>>>>>> this is
>>>>>> a 15x performance improvement that users can get for free, with no user
>>>>>> pipeline changes.
>>>>>>
>>>>>>
>>>>>> The JIRA for this change is here
>>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch 
>>>>>> is
>>>>>> available here (https://github.com/apache/beam/pull/4634). I have been
>>>>>> working over the last month on making this an automatic drop-in 
>>>>>> replacement
>>>>>> for the current DirectRunner when applicable.  Before it becomes the
>>>>>> default, you can try this runner now by manually specifying
>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner.
>>>>>>
>>>>>>
>>>>>> Even with this change, local Python pipeline execution can only
>>>>>> effectively use one core because of the Python GIL.  A natural next step 
>>>>>> to
>>>>>> further improve performance will be to refactor the FnApiRunner to allow 
>>>>>> for
>>>>>> multi-process execution.  This is being tracked here
>>>>>> (https://issues.apache.org/jira/browse/BEAM-3645).
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Charles
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Got feedback? go/pabloem-feedback
>>
>>

Reply via email to