Sounds impressive, and with the extra portability stuff, great ! Worth the switch just for he user experience improvement.
On Thu, Feb 8, 2018 at 5:52 PM, Robert Bradshaw <rober...@google.com> wrote: > This is going to be a great improvement for our users! I'll take a > look at the pull request. > > On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <k...@google.com> wrote: >> Nice! >> >> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote: >>> >>> The existing DirectRunner will be needed for the foreseeable future since >>> it is currently the only local runner that supports streaming execution. >>> >>> >>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote: >>>> >>>> Very cool Charles! Have you considered whether you'll want to remove the >>>> direct runner code afterwards? >>>> Best >>>> -P. >>>> >>>> >>>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote: >>>>> >>>>> That is pretty awesome. >>>>> >>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote: >>>>>> >>>>>> Local execution of Beam pipelines on the Python DirectRunner currently >>>>>> suffers from performance issues, which makes it hard for pipeline >>>>>> authors to >>>>>> iterate, especially on medium to large size datasets. We would like to >>>>>> optimize and make this a better experience for Beam users. >>>>>> >>>>>> >>>>>> The FnApiRunner was written as a way of leveraging the portability >>>>>> framework execution code path for local portability development. We've >>>>>> found >>>>>> it also provides great speedups in batch execution with no user changes >>>>>> required, so we propose to switch to use this runner by default in batch >>>>>> pipelines. For example, WordCount on the Shakespeare dataset with a >>>>>> single >>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes before; >>>>>> this is >>>>>> a 15x performance improvement that users can get for free, with no user >>>>>> pipeline changes. >>>>>> >>>>>> >>>>>> The JIRA for this change is here >>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch >>>>>> is >>>>>> available here (https://github.com/apache/beam/pull/4634). I have been >>>>>> working over the last month on making this an automatic drop-in >>>>>> replacement >>>>>> for the current DirectRunner when applicable. Before it becomes the >>>>>> default, you can try this runner now by manually specifying >>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner. >>>>>> >>>>>> >>>>>> Even with this change, local Python pipeline execution can only >>>>>> effectively use one core because of the Python GIL. A natural next step >>>>>> to >>>>>> further improve performance will be to refactor the FnApiRunner to allow >>>>>> for >>>>>> multi-process execution. This is being tracked here >>>>>> (https://issues.apache.org/jira/browse/BEAM-3645). >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Charles >>>>> >>>>> >>>> >>>> >>>> -- >>>> Got feedback? go/pabloem-feedback >> >>