[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort
[ https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634849#comment-16634849 ] Amit Sela commented on BEAM-5519: - [~winkelman.kyle] you bring up a good point. IIRC we did all this mess to guarantee that in case a shuffle is required by Spark, we control it (initiate it), and it applies to RDDs containing serialized data only. This might have been _before_ we got the "force default partitioner" in place, or a mixed process.. not sure. You can do test your change (in streaming, which uses {{UpdateStateByKey}} and so creates shuffles etc.) to make sure that no shuffle occurs on RDDs containing deserialized data. In addition, you can use a non-Kryo serializable (if the runner still defaults to Kryo underneath..) and make sure it doesn't fail. Hope that helps! > Spark Streaming Duplicated Encoding/Decoding Effort > --- > > Key: BEAM-5519 > URL: https://issues.apache.org/jira/browse/BEAM-5519 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Kyle Winkelman >Assignee: Kyle Winkelman >Priority: Major > Labels: spark, spark-streaming > Time Spent: 50m > Remaining Estimate: 0h > > When using the SparkRunner in streaming mode. There is a call to groupByKey > followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this > used to cause 2 shuffles but it still causes 2 encode/decode cycles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort
[ https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631017#comment-16631017 ] Kyle Winkelman commented on BEAM-5519: -- Proposed: // SparkGroupAlsoByWindowViaWindowSet.buildPairDStream JavaRDD>> JavaRDD>>> JavaRDD>> JavaPairRDD // UpdateStateByKeyOutputIterator.computeNext gets the scala.collection.Seq the seq of values that have the same key decoded to scala.collection.Seq> (convert to Iterable) > Spark Streaming Duplicated Encoding/Decoding Effort > --- > > Key: BEAM-5519 > URL: https://issues.apache.org/jira/browse/BEAM-5519 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Kyle Winkelman >Assignee: Kyle Winkelman >Priority: Major > Labels: spark, spark-streaming > > When using the SparkRunner in streaming mode. There is a call to groupByKey > followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this > used to cause 2 shuffles but it still causes 2 encode/decode cycles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort
[ https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631012#comment-16631012 ] Kyle Winkelman commented on BEAM-5519: -- Current: // GroupCombineFunctions.groupByKeyOnly JavaRDD>> JavaRDD>>> JavaRDD>> JavaPairRDD> JavaPairRDD JavaPairRdd> // groupByKey JavaPairRDD>> JavaRDD>>> JavaRDD // SparkGroupAlsoByWindowViaWindowSet.buildPairDStream JavaRDD>>> JavaPairRDD>> JavaPairRDD>>> JavaPairRDD // UpdateStateByKeyOutputIterator.computeNext gets the scala.collection.Seq the seq of values that have the same key decoded to scala.collection.Seq>>> (zero or one items because we have already grouped by key) get the head of the Seq and pull out the Iterable> > Spark Streaming Duplicated Encoding/Decoding Effort > --- > > Key: BEAM-5519 > URL: https://issues.apache.org/jira/browse/BEAM-5519 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Kyle Winkelman >Assignee: Kyle Winkelman >Priority: Major > Labels: spark, spark-streaming > > When using the SparkRunner in streaming mode. There is a call to groupByKey > followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this > used to cause 2 shuffles but it still causes 2 encode/decode cycles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)