This is going to be a great improvement for our users! I'll take a look at the pull request.
On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles <[email protected]> wrote: > Nice! > > On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen <[email protected]> wrote: >> >> The existing DirectRunner will be needed for the foreseeable future since >> it is currently the only local runner that supports streaming execution. >> >> >> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <[email protected]> wrote: >>> >>> Very cool Charles! Have you considered whether you'll want to remove the >>> direct runner code afterwards? >>> Best >>> -P. >>> >>> >>> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <[email protected]> wrote: >>>> >>>> That is pretty awesome. >>>> >>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <[email protected]> wrote: >>>>> >>>>> Local execution of Beam pipelines on the Python DirectRunner currently >>>>> suffers from performance issues, which makes it hard for pipeline authors >>>>> to >>>>> iterate, especially on medium to large size datasets. We would like to >>>>> optimize and make this a better experience for Beam users. >>>>> >>>>> >>>>> The FnApiRunner was written as a way of leveraging the portability >>>>> framework execution code path for local portability development. We've >>>>> found >>>>> it also provides great speedups in batch execution with no user changes >>>>> required, so we propose to switch to use this runner by default in batch >>>>> pipelines. For example, WordCount on the Shakespeare dataset with a >>>>> single >>>>> CPU core now takes 50 seconds to run, compared to 12 minutes before; this >>>>> is >>>>> a 15x performance improvement that users can get for free, with no user >>>>> pipeline changes. >>>>> >>>>> >>>>> The JIRA for this change is here >>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch >>>>> is >>>>> available here (https://github.com/apache/beam/pull/4634). I have been >>>>> working over the last month on making this an automatic drop-in >>>>> replacement >>>>> for the current DirectRunner when applicable. Before it becomes the >>>>> default, you can try this runner now by manually specifying >>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner. >>>>> >>>>> >>>>> Even with this change, local Python pipeline execution can only >>>>> effectively use one core because of the Python GIL. A natural next step >>>>> to >>>>> further improve performance will be to refactor the FnApiRunner to allow >>>>> for >>>>> multi-process execution. This is being tracked here >>>>> (https://issues.apache.org/jira/browse/BEAM-3645). >>>>> >>>>> >>>>> Best, >>>>> >>>>> Charles >>>> >>>> >>> >>> >>> -- >>> Got feedback? go/pabloem-feedback > >
