Charles Chen created BEAM-3644:
----------------------------------
Summary: Speeding up Python DirectRunner execution by using the
FnApiRunner when possible
Key: BEAM-3644
URL: https://issues.apache.org/jira/browse/BEAM-3644
Project: Beam
Issue Type: Improvement
Components: sdk-py-core
Affects Versions: 2.2.0, 2.3.0
Reporter: Charles Chen
Assignee: Charles Chen
Local execution of Beam pipelines on the current Python DirectRunner currently
suffers from performance issues, which makes it hard for pipeline authors to
iterate, especially on medium to large size datasets. We would like to optimize
and make this a better experience for Beam users.
In the past few months, Robert implemented the FnApiRunner as a way of
leveraging the portability framework execution code path for local execution.
We've found great speedups in batch execution, so we propose to switch to use
this runner in batch pipelines. For example, WordCount on the Shakespeare
dataset with a single CPU core now takes 50 seconds to run, compared to 12
minutes before, a 15x performance improvement that users can get for free, with
no pipeline changes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)