[ 
https://issues.apache.org/jira/browse/BEAM-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332707#comment-17332707
 ] 

Ismaël Mejía commented on BEAM-10670:
-------------------------------------

I can confirm a degradation of between 15-20% of performance when using the now 
'default' execution option on the Spark runner. We tested this via TPC-DS query 
3 in a 1000GB input dataset with CSV input via TextIO.

Something puzzling is that I can see a performance degradation even when the 
inputs are not based on the traditional Read transform for example 
ParquetIO.withSplit (based on SDF) performance is worse by default that when 
configured with `–experiments=use_deprecated_read`. Something odd is going on 
here. Do you think we can get someone to go deeper into this [~boyuanz] maybe? 
otherwise probably it is best that we opt out of this for the next release 
until its performance is better.

CC [~kenn]

> Make non-portable Splittable DoFn the only option when executing Java "Read" 
> transforms
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-10670
>                 URL: https://issues.apache.org/jira/browse/BEAM-10670
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Luke Cwik
>            Priority: P3
>              Labels: Clarified
>          Time Spent: 37h 50m
>  Remaining Estimate: 0h
>
> All runners seem to be capable of migrating to splittable DoFn for 
> non-portable execution except for Dataflow runner v1 which will internalize 
> the current primitive read implementation that is shared across runner 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to