[jira] [Commented] (BEAM-2516) User reports 4 minutes to process 1 million line CSV in DirectRunner

Kenneth Knowles (JIRA) Wed, 30 Aug 2017 21:09:28 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148418#comment-16148418
 ]


Kenneth Knowles commented on BEAM-2516:
---------------------------------------

I think for 2.2.0 it is best to remove the translation to/from a proto by 
hiding it behind PipelineOptions.

There's a lot of overhead right now because of the impedance mismatch between 
the parts that are still Java-specific and the parts which are SDK-agnostic. In 
the full story for the portability framework, the DoFns and other UDFs can't 
even be deserialized, but shipped to the SDK harness. The harness will own the 
caching, so it probably doesn't make sense to add it to the DirectRunner unless 
there's one silly repeated deserialization we can eliminate. Based on the 
profiling results, perhaps there is, but no need to block anything on it.

> User reports 4 minutes to process 1 million line CSV in DirectRunner
> --------------------------------------------------------------------
>
>                 Key: BEAM-2516
>                 URL: https://issues.apache.org/jira/browse/BEAM-2516
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct
>            Reporter: Kenneth Knowles
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> https://stackoverflow.com/questions/44736414/simple-apache-beam-manipulations-work-very-slow
> I don't know what the expectation are here, so I wasn't ready to say this is 
> WAI. Low priority since it isn't what the runner is for anyhow, but this 
> seems like the scale of data that should be snappy. Worth investigating, or 
> maybe you can quickly indicate why it is expected?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2516) User reports 4 minutes to process 1 million line CSV in DirectRunner

Reply via email to