[ 
https://issues.apache.org/jira/browse/BEAM-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3644:
----------------------------------
    Description: 
Local execution of Beam pipelines on the current Python DirectRunner currently 
suffers from performance issues, which makes it hard for pipeline authors to 
iterate, especially on medium to large size datasets. We would like to optimize 
and make this a better experience for Beam users.

The FnApiRunner was written as a way of leveraging the portability framework 
execution code path for local execution for portability development. We've 
found it also offers great speedups in batch execution, so we propose to switch 
to use this runner in batch pipelines. For example, WordCount on the 
Shakespeare dataset with a single CPU core now takes 50 seconds to run, 
compared to 12 minutes before, a 15x performance improvement that users can get 
for free, with no pipeline changes.

  was:
Local execution of Beam pipelines on the current Python DirectRunner currently 
suffers from performance issues, which makes it hard for pipeline authors to 
iterate, especially on medium to large size datasets. We would like to optimize 
and make this a better experience for Beam users.

In the past few months, Robert implemented the FnApiRunner as a way of 
leveraging the portability framework execution code path for local execution. 
We've found great speedups in batch execution, so we propose to switch to use 
this runner in batch pipelines. For example, WordCount on the Shakespeare 
dataset with a single CPU core now takes 50 seconds to run, compared to 12 
minutes before, a 15x performance improvement that users can get for free, with 
no pipeline changes.


> Speed up Python DirectRunner execution by using the FnApiRunner when possible
> -----------------------------------------------------------------------------
>
>                 Key: BEAM-3644
>                 URL: https://issues.apache.org/jira/browse/BEAM-3644
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>    Affects Versions: 2.2.0, 2.3.0
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>            Priority: Major
>
> Local execution of Beam pipelines on the current Python DirectRunner 
> currently suffers from performance issues, which makes it hard for pipeline 
> authors to iterate, especially on medium to large size datasets. We would 
> like to optimize and make this a better experience for Beam users.
> The FnApiRunner was written as a way of leveraging the portability framework 
> execution code path for local execution for portability development. We've 
> found it also offers great speedups in batch execution, so we propose to 
> switch to use this runner in batch pipelines. For example, WordCount on the 
> Shakespeare dataset with a single CPU core now takes 50 seconds to run, 
> compared to 12 minutes before, a 15x performance improvement that users can 
> get for free, with no pipeline changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to