WangGuangxin commented on pull request #35460: URL: https://github.com/apache/spark/pull/35460#issuecomment-1041113248
> Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes more sense, not least of which because it is reproducible and deterministic across runs and environments Reject it maybe not a good idea. 1. Both Hive/Presto support patterns like `distributed by rand` or `Join/GroupBy by rand`. And seems Spark is intent to support `groupby by rand`, refer https://github.com/apache/spark/pull/16404. And also some udfs or java_methods is marked as indeterminated, if we reject it, it means users cannot join/groupby a column generated by udf/java_methods. 2. The root cause of data inconsist when shuffle by rand expression is Spark only retry partially map tasks when shuffle fetch failed. If we retry the whole stage, there is no problem. We can utilize current logic in DAGScheduler to achieve this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
