[
https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631012#comment-16631012
]
Kyle Winkelman commented on BEAM-5519:
--------------------------------------
Current:
// GroupCombineFunctions.groupByKeyOnly
JavaRDD<WindowedValue<KV<K, V>>>
JavaRDD<WindowedValue<KV<K, WindowedValue<V>>>>
JavaRDD<KV<K, WindowedValue<V>>>
JavaPairRDD<K, WindowedValue<V>>
JavaPairRDD<ByteArray, byte[]>
JavaPairRdd<ByteArray, Iterable<byte[]>> // groupByKey
JavaPairRDD<K, Iterable<WindowedValue<V>>>
JavaRDD<KV<K, Iterable<WindowedValue<V>>>>
JavaRDD<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>>
// SparkGroupAlsoByWindowViaWindowSet.buildPairDStream
JavaRDD<KV<K, Iterable<WindowedValue<V>>>>
JavaPairRDD<K, Iterable<WindowedValue<V>>>
JavaPairRDD<K, KV<Long, Iterable<WindowedValue<V>>>>
JavaPairRDD<ByteArray, byte[]>
// UpdateStateByKeyOutputIterator.computeNext
gets the scala.collection.Seq<byte[]> the seq of values that have the same key
decoded to scala.collection.Seq<KV<Long, Iterable<WindowedValue<V>>>> (zero or
one items because we have already grouped by key)
get the head of the Seq and pull out the Iterable<WindowedValue<V>>
> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Kyle Winkelman
> Assignee: Kyle Winkelman
> Priority: Major
> Labels: spark, spark-streaming
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)