[
https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634849#comment-16634849
]
Amit Sela commented on BEAM-5519:
---------------------------------
[~winkelman.kyle] you bring up a good point.
IIRC we did all this mess to guarantee that in case a shuffle is required by
Spark, we control it (initiate it), and it applies to RDDs containing
serialized data only.
This might have been _before_ we got the "force default partitioner" in place,
or a mixed process.. not sure.
You can do test your change (in streaming, which uses {{UpdateStateByKey}} and
so creates shuffles etc.) to make sure that no shuffle occurs on RDDs
containing deserialized data. In addition, you can use a non-Kryo serializable
(if the runner still defaults to Kryo underneath..) and make sure it doesn't
fail.
Hope that helps!
> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Kyle Winkelman
> Assignee: Kyle Winkelman
> Priority: Major
> Labels: spark, spark-streaming
> Time Spent: 50m
> Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)