[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

GitBox Tue, 15 Feb 2022 21:07:27 -0800


WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041113248



   > Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes 
more sense, not least of which because it is reproducible and deterministic 
across runs and environments
   
   Reject it maybe not a good idea.
   1. Both Hive/Presto support patterns like `distributed by rand` or 
`Join/GroupBy by rand`. And seems Spark is intent to support `groupby by rand`, 
refer https://github.com/apache/spark/pull/16404.  And also some udfs or 
java_methods is marked as indeterminated, if we reject it, it means users 
cannot join/groupby a column generated by udf/java_methods.
   2. The root cause of data inconsist when shuffle by rand expression is Spark 
only retry partially map tasks when shuffle fetch failed. If we retry the whole 
stage, there is no problem. We can utilize current logic in DAGScheduler to 
achieve this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Reply via email to