Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 @mridulm Thanks for pointing out that comment, I hadn't seen it, its a very nice write up. I don't agree that " We actually cannot support random output". Users can do this now in MR and spark and we can't really stop them other then say we don't support and if you do failure handling will cause different results. > Ideally, I would like to see order sensitive closure's fixed - and fixing repartition + shuffle would fix this for a general case for all order sensitive closures. This is what I'm trying to get at. We need to make this decision as to whether we truly want to fix these or just give the user a warning about them. Or perhaps its a combination depending on what it is. I agree that most likely whatever fix we make for repartition could be applied to the others as well. I don't want us to document it away now and then change our mind in next release. Our end decision should be final. Note that so far I don't see how this temporary workaround can be extended for ResultTasks. The only way I truly see this is to make it sorted. Which like you said I think in spark is very expensive and you would have to force it because you don't know if it would fail or not. It also may only apply to certain operations we control like zip. > There should be a lot of usage within mllib (atleast yahoo's internal BigML library did have a lot of it). Do you know more specifics here, what are the ML libraries doing to cause this, some sort of sampling or is it just the algorithm generates slightly different on the order of shuffle?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org