[
https://issues.apache.org/jira/browse/BEAM-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332707#comment-17332707
]
Ismaël Mejía commented on BEAM-10670:
-------------------------------------
I can confirm a degradation of between 15-20% of performance when using the now
'default' execution option on the Spark runner. We tested this via TPC-DS query
3 in a 1000GB input dataset with CSV input via TextIO.
Something puzzling is that I can see a performance degradation even when the
inputs are not based on the traditional Read transform for example
ParquetIO.withSplit (based on SDF) performance is worse by default that when
configured with `–experiments=use_deprecated_read`. Something odd is going on
here. Do you think we can get someone to go deeper into this [~boyuanz] maybe?
otherwise probably it is best that we opt out of this for the next release
until its performance is better.
CC [~kenn]
> Make non-portable Splittable DoFn the only option when executing Java "Read"
> transforms
> ---------------------------------------------------------------------------------------
>
> Key: BEAM-10670
> URL: https://issues.apache.org/jira/browse/BEAM-10670
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Reporter: Luke Cwik
> Priority: P3
> Labels: Clarified
> Time Spent: 37h 50m
> Remaining Estimate: 0h
>
> All runners seem to be capable of migrating to splittable DoFn for
> non-portable execution except for Dataflow runner v1 which will internalize
> the current primitive read implementation that is shared across runner
> implementations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)