Very cool Charles! Have you considered whether you'll want to remove the direct runner code afterwards? Best -P.
On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote: > That is pretty awesome. > > On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote: > >> Local execution of Beam pipelines on the Python DirectRunner currently >> suffers from performance issues, which makes it hard for pipeline authors >> to iterate, especially on medium to large size datasets. We would like to >> optimize and make this a better experience for Beam users. >> >> The FnApiRunner was written as a way of leveraging the portability >> framework execution code path for local portability development. We've >> found it also provides great speedups in batch execution with no user >> changes required, so we propose to switch to use this runner by default in >> batch pipelines. For example, WordCount on the Shakespeare dataset with a >> single CPU core now takes 50 seconds to run, compared to 12 minutes before; >> this is a 15x performance improvement that users can get for free, with >> no user pipeline changes. >> >> The JIRA for this change is here ( >> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch >> is available here (https://github.com/apache/beam/pull/4634). I have >> been working over the last month on making this an automatic drop-in >> replacement for the current DirectRunner when applicable. Before it >> becomes the default, you can try this runner now by manually specifying >> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner. >> >> Even with this change, local Python pipeline execution can only >> effectively use one core because of the Python GIL. A natural next step to >> further improve performance will be to refactor the FnApiRunner to allow >> for multi-process execution. This is being tracked here ( >> https://issues.apache.org/jira/browse/BEAM-3645). >> >> Best, >> >> Charles >> > > -- Got feedback? go/pabloem-feedback