[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort

2018-10-01 Thread Amit Sela (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634849#comment-16634849
 ] 

Amit Sela commented on BEAM-5519:
-

[~winkelman.kyle] you bring up a good point.
IIRC we did all this mess to guarantee that in case a shuffle is required by 
Spark, we control it (initiate it), and it applies to RDDs containing 
serialized data only.
This might have been _before_ we got the "force default partitioner" in place, 
or a mixed process.. not sure.
You can do test your change (in streaming, which uses {{UpdateStateByKey}} and 
so creates shuffles etc.) to make sure that no shuffle occurs on RDDs 
containing deserialized data. In addition, you can use a non-Kryo serializable 
(if the runner still defaults to Kryo underneath..) and make sure it doesn't 
fail.
Hope that helps!

> Spark Streaming Duplicated Encoding/Decoding Effort
> ---
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Winkelman
>Assignee: Kyle Winkelman
>Priority: Major
>  Labels: spark, spark-streaming
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey 
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this 
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort

2018-09-27 Thread Kyle Winkelman (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631017#comment-16631017
 ] 

Kyle Winkelman commented on BEAM-5519:
--

Proposed:

// SparkGroupAlsoByWindowViaWindowSet.buildPairDStream
JavaRDD>>
JavaRDD>>>
JavaRDD>>
JavaPairRDD

// UpdateStateByKeyOutputIterator.computeNext
gets the scala.collection.Seq the seq of values that have the same key
decoded to scala.collection.Seq> (convert to Iterable)


> Spark Streaming Duplicated Encoding/Decoding Effort
> ---
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Winkelman
>Assignee: Kyle Winkelman
>Priority: Major
>  Labels: spark, spark-streaming
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey 
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this 
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort

2018-09-27 Thread Kyle Winkelman (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631012#comment-16631012
 ] 

Kyle Winkelman commented on BEAM-5519:
--

Current:
// GroupCombineFunctions.groupByKeyOnly
JavaRDD>>
JavaRDD>>>
JavaRDD>>
JavaPairRDD>
JavaPairRDD
JavaPairRdd> // groupByKey
JavaPairRDD>>
JavaRDD>>>
JavaRDD

// SparkGroupAlsoByWindowViaWindowSet.buildPairDStream
JavaRDD>>>
JavaPairRDD>>
JavaPairRDD>>>
JavaPairRDD

// UpdateStateByKeyOutputIterator.computeNext
gets the scala.collection.Seq the seq of values that have the same key
decoded to scala.collection.Seq>>> (zero or 
one items because we have already grouped by key)
get the head of the Seq and pull out the Iterable>



> Spark Streaming Duplicated Encoding/Decoding Effort
> ---
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Winkelman
>Assignee: Kyle Winkelman
>Priority: Major
>  Labels: spark, spark-streaming
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey 
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this 
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)