[GitHub] [spark] JoshRosen edited a comment on pull request #34158: [SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization

GitBox Mon, 04 Oct 2021 21:05:28 -0700


JoshRosen edited a comment on pull request #34158:
URL: https://github.com/apache/spark/pull/34158#issuecomment-934039150



   Stepping back a bit, I have a more fundamental conceptual question about the 
push-based shuffle configurations: could the decision of whether to use 
push-based shuffle be performed on a per-ShuffleDependency basis rather than a 
per-app basis?
   
   The `spark.serializer` configuration controls the default serializer, but 
under the hood `ShuffleDependency` instances are sometimes automatically 
configured with non-default serializers:
   
   - Pure SQL / DataFrame / Dataset code [uses 
`UnsafeRowSerializer`](https://github.com/apache/spark/blob/41a16ebf1196bec86aec104e72fd7fb1597c0073/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L132-L133).
   - The RDD shuffle path [will automatically use 
Kryo](https://github.com/apache/spark/blame/41a16ebf1196bec86aec104e72fd7fb1597c0073/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L99-L102)
 when shuffling RDDs containing only primitive types.
   
   Spark's shuffle code contains a [serialized sorting 
mode](https://github.com/apache/spark/blame/fa1805db48ca53ece4cbbe42ebb2a9811a142ed2/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L37-L71)
 which can only be used when  `serializer. 
supportsRelocationOfSerializedObjects == true`. The decision of whether to use 
serialized sorting mode is performed on a per-ShuffleDependency (not per-app) 
basis (by [using a different 
ShuffleHandle](https://github.com/apache/spark/blame/fa1805db48ca53ece4cbbe42ebb2a9811a142ed2/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L106-L109)).
   
   If we did something similar for push-based shuffle then users could enable 
PBS and still get some benefit even if they couldn't / didn't want to use Kryo 
as the default serializer. With that approach we wouldn't need to construct a 
default serializer instance: during `SparkEnv` creation time or executor 
startup time we'd only need to know whether the other prerequisites of PBS were 
met (configuration enabled, shuffle service enabled, IO encryption disabled, 
etc). We'd still need to check the serializer properties, but that check would 
happen on the driver and its result would somehow be encoded in the 
`ShuffleHandle`.
   
   Is that possible given the architecture of push-based shuffle?
   
   **Edit**: to clarify, I'm posing this as a longer-term question (not for 
Spark 3.2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen edited a comment on pull request #34158: [SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization

Reply via email to