Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 Catching up on discussion ... @cloud-fan > shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. Let me rephrase that - key ordering with aggregator specified. Unfortunately this will then mean it is applicable only to custom user code - since default spark api's do not set both. > The reduce task fetches multiple remote shuffle blocks at the same time, so the order is always random. This is not a characteristics of shuffle in MR based systems, but an implementation detail of shuffle in spark. In hadoop mapreduce, for example, shuffle output is always ordered and this problem does not occur. > In Addition, Spark SQL never specifies key ordering. Spark SQL has re-implemented a lot of the spark core primitives - given this, I would expect spark sql to : * When there is a rdd view gets generated off a dataframe, a local sort be introduced where appropriate - as has already been done in SPARK-23207 for repartition case. and/or * appropriately expose IDEMPOTENT, UNORDERED and INDETERMINATE in RDD view. @tgravescs > I don't agree that " We actually cannot support random output". Users can do this now in MR and spark and we can't really stop them other then say we don't support and if you do failure handling will cause different results. What I mentioned was not specific to spark, but general to any MR like system. This applies even in hadoop mapreduce and used to be a bug in some of our pig udf's :-) For example, if there is random output generated in mapper and there are node failures during reducer phase (after all mapper's have completed), the exact same problem would occur with random mapper output. We cannot, ofcourse, stop users from doing it - but we do not guarantee correct results (just as hadoop mapreduce does not in this scenario). > I don't want us to document it away now and then change our mind in next release. Our end decision should be final. My current thought is as follows: Without making shuffle output order repeatable, we do not have a way to properly fix this. My understanding from @jiangxb1987, who has looked at it in detail with @sameeragarwal and others, is that this is a very difficult invariant to achieve in current spark codebase for shuffle in general. (Please holler if I am off base @jiangxb1987 !) With the assumption that we cannot currently fix this - explicitly warn'ing user and/or reschedule all tasks/stages for correctness might be a good stop gap. User's could mitigate the performance impact via checkpoint'ing [1] - I would expect this to be the go-to solution; for any non trivial job, the perf characteristics and SLA violations are going to be terrible after this patch is applied when failures occur : but we should not have any data loss. In future, we might resolve this issue in a more principled manner. [1] As @cloud-fan's pointed out [here|https://github.com/apache/spark/pull/22112#issuecomment-414034703] sort is not gaurantee'ed to work - unless key's are unique : since ordering is defined only on key and not value (and so value re-order can occur). @cloud-fan > This is the problem we are resolving here. This assumption is incorrect, and the RDD closure should handle it, or use what I proposed in this PR: the retry strategy. I would disagree with this - this is an artifact of implementation detail of spark shuffle - and is not the expected behavior for a MR based system. Unfortunately, this has been the behavior since beginning IMO (atleast since 0.6) IMO this was not a conscious design choice, but rather an oversight. > IIRC @mridulm didn't agree with it. One problem is that, it's hard for users to realize that Spark returns wrong result, so they don't know when to handle it. Actually I would expect user's to end up doing either of these two - the perf characteristics and lack of predictability in SLA after this patch are going to force users to choose one of the two.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org