Awesome! Well done, Charles.

On Thu, Feb 8, 2018 at 9:10 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> Sounds impressive, and with the extra portability stuff, great !
> Worth the switch just for he user experience improvement.
>
> On Thu, Feb 8, 2018 at 5:52 PM, Robert Bradshaw <rober...@google.com>
> wrote:
> > This is going to be a great improvement for our users! I'll take a
> > look at the pull request.
> >
> > On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <k...@google.com> wrote:
> >> Nice!
> >>
> >> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <c...@google.com> wrote:
> >>>
> >>> The existing DirectRunner will be needed for the foreseeable future
> since
> >>> it is currently the only local runner that supports streaming
> execution.
> >>>
> >>>
> >>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote:
> >>>>
> >>>> Very cool Charles! Have you considered whether you'll want to remove
> the
> >>>> direct runner code afterwards?
> >>>> Best
> >>>> -P.
> >>>>
> >>>>
> >>>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:
> >>>>>
> >>>>> That is pretty awesome.
> >>>>>
> >>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
> >>>>>>
> >>>>>> Local execution of Beam pipelines on the Python DirectRunner
> currently
> >>>>>> suffers from performance issues, which makes it hard for pipeline
> authors to
> >>>>>> iterate, especially on medium to large size datasets.  We would
> like to
> >>>>>> optimize and make this a better experience for Beam users.
> >>>>>>
> >>>>>>
> >>>>>> The FnApiRunner was written as a way of leveraging the portability
> >>>>>> framework execution code path for local portability development.
> We've found
> >>>>>> it also provides great speedups in batch execution with no user
> changes
> >>>>>> required, so we propose to switch to use this runner by default in
> batch
> >>>>>> pipelines.  For example, WordCount on the Shakespeare dataset with
> a single
> >>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes
> before; this is
> >>>>>> a 15x performance improvement that users can get for free, with no
> user
> >>>>>> pipeline changes.
> >>>>>>
> >>>>>>
> >>>>>> The JIRA for this change is here
> >>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate
> patch is
> >>>>>> available here (https://github.com/apache/beam/pull/4634). I have
> been
> >>>>>> working over the last month on making this an automatic drop-in
> replacement
> >>>>>> for the current DirectRunner when applicable.  Before it becomes the
> >>>>>> default, you can try this runner now by manually specifying
> >>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
> runner.
> >>>>>>
> >>>>>>
> >>>>>> Even with this change, local Python pipeline execution can only
> >>>>>> effectively use one core because of the Python GIL.  A natural next
> step to
> >>>>>> further improve performance will be to refactor the FnApiRunner to
> allow for
> >>>>>> multi-process execution.  This is being tracked here
> >>>>>> (https://issues.apache.org/jira/browse/BEAM-3645).
> >>>>>>
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Charles
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Got feedback? go/pabloem-feedback
> >>
> >>
>

Reply via email to