[ https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631012#comment-16631012 ]
Kyle Winkelman commented on BEAM-5519: -------------------------------------- Current: // GroupCombineFunctions.groupByKeyOnly JavaRDD<WindowedValue<KV<K, V>>> JavaRDD<WindowedValue<KV<K, WindowedValue<V>>>> JavaRDD<KV<K, WindowedValue<V>>> JavaPairRDD<K, WindowedValue<V>> JavaPairRDD<ByteArray, byte[]> JavaPairRdd<ByteArray, Iterable<byte[]>> // groupByKey JavaPairRDD<K, Iterable<WindowedValue<V>>> JavaRDD<KV<K, Iterable<WindowedValue<V>>>> JavaRDD<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>> // SparkGroupAlsoByWindowViaWindowSet.buildPairDStream JavaRDD<KV<K, Iterable<WindowedValue<V>>>> JavaPairRDD<K, Iterable<WindowedValue<V>>> JavaPairRDD<K, KV<Long, Iterable<WindowedValue<V>>>> JavaPairRDD<ByteArray, byte[]> // UpdateStateByKeyOutputIterator.computeNext gets the scala.collection.Seq<byte[]> the seq of values that have the same key decoded to scala.collection.Seq<KV<Long, Iterable<WindowedValue<V>>>> (zero or one items because we have already grouped by key) get the head of the Seq and pull out the Iterable<WindowedValue<V>> > Spark Streaming Duplicated Encoding/Decoding Effort > --------------------------------------------------- > > Key: BEAM-5519 > URL: https://issues.apache.org/jira/browse/BEAM-5519 > Project: Beam > Issue Type: Bug > Components: runner-spark > Reporter: Kyle Winkelman > Assignee: Kyle Winkelman > Priority: Major > Labels: spark, spark-streaming > > When using the SparkRunner in streaming mode. There is a call to groupByKey > followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this > used to cause 2 shuffles but it still causes 2 encode/decode cycles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)