Awesome! Well done, Charles. On Thu, Feb 8, 2018 at 9:10 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
> Sounds impressive, and with the extra portability stuff, great ! > Worth the switch just for he user experience improvement. > > On Thu, Feb 8, 2018 at 5:52 PM, Robert Bradshaw <rober...@google.com> > wrote: > > This is going to be a great improvement for our users! I'll take a > > look at the pull request. > > > > On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <k...@google.com> wrote: > >> Nice! > >> > >> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote: > >>> > >>> The existing DirectRunner will be needed for the foreseeable future > since > >>> it is currently the only local runner that supports streaming > execution. > >>> > >>> > >>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote: > >>>> > >>>> Very cool Charles! Have you considered whether you'll want to remove > the > >>>> direct runner code afterwards? > >>>> Best > >>>> -P. > >>>> > >>>> > >>>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote: > >>>>> > >>>>> That is pretty awesome. > >>>>> > >>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote: > >>>>>> > >>>>>> Local execution of Beam pipelines on the Python DirectRunner > currently > >>>>>> suffers from performance issues, which makes it hard for pipeline > authors to > >>>>>> iterate, especially on medium to large size datasets. We would > like to > >>>>>> optimize and make this a better experience for Beam users. > >>>>>> > >>>>>> > >>>>>> The FnApiRunner was written as a way of leveraging the portability > >>>>>> framework execution code path for local portability development. > We've found > >>>>>> it also provides great speedups in batch execution with no user > changes > >>>>>> required, so we propose to switch to use this runner by default in > batch > >>>>>> pipelines. For example, WordCount on the Shakespeare dataset with > a single > >>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes > before; this is > >>>>>> a 15x performance improvement that users can get for free, with no > user > >>>>>> pipeline changes. > >>>>>> > >>>>>> > >>>>>> The JIRA for this change is here > >>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate > patch is > >>>>>> available here (https://github.com/apache/beam/pull/4634). I have > been > >>>>>> working over the last month on making this an automatic drop-in > replacement > >>>>>> for the current DirectRunner when applicable. Before it becomes the > >>>>>> default, you can try this runner now by manually specifying > >>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the > runner. > >>>>>> > >>>>>> > >>>>>> Even with this change, local Python pipeline execution can only > >>>>>> effectively use one core because of the Python GIL. A natural next > step to > >>>>>> further improve performance will be to refactor the FnApiRunner to > allow for > >>>>>> multi-process execution. This is being tracked here > >>>>>> (https://issues.apache.org/jira/browse/BEAM-3645). > >>>>>> > >>>>>> > >>>>>> Best, > >>>>>> > >>>>>> Charles > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Got feedback? go/pabloem-feedback > >> > >> >