That is pretty awesome. On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
> Local execution of Beam pipelines on the Python DirectRunner currently > suffers from performance issues, which makes it hard for pipeline authors > to iterate, especially on medium to large size datasets. We would like to > optimize and make this a better experience for Beam users. > > The FnApiRunner was written as a way of leveraging the portability > framework execution code path for local portability development. We've > found it also provides great speedups in batch execution with no user > changes required, so we propose to switch to use this runner by default in > batch pipelines. For example, WordCount on the Shakespeare dataset with a > single CPU core now takes 50 seconds to run, compared to 12 minutes before; > this is a 15x performance improvement that users can get for free, with > no user pipeline changes. > > The JIRA for this change is here (https://issues.apache.org/ > jira/browse/BEAM-3644), and a candidate patch is available here ( > https://github.com/apache/beam/pull/4634). I have been working over the > last month on making this an automatic drop-in replacement for the current > DirectRunner when applicable. Before it becomes the default, you can try > this runner now by manually specifying apache_beam.runners. > portability.fn_api_runner.FnApiRunner as the runner. > > Even with this change, local Python pipeline execution can only > effectively use one core because of the Python GIL. A natural next step to > further improve performance will be to refactor the FnApiRunner to allow > for multi-process execution. This is being tracked here ( > https://issues.apache.org/jira/browse/BEAM-3645). > > Best, > > Charles >