Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
@mridulm Thanks for pointing out that comment, I hadn't seen it, its a very
nice write up. I don't agree that " We actually cannot support random output".
Users can do this now in MR and spark and we can't really stop them other then
say we don't support and if you do failure handling will cause different
results.
> Ideally, I would like to see order sensitive closure's fixed - and fixing
repartition + shuffle would fix this for a general case for all order sensitive
closures.
This is what I'm trying to get at. We need to make this decision as to
whether we truly want to fix these or just give the user a warning about them.
Or perhaps its a combination depending on what it is. I agree that most likely
whatever fix we make for repartition could be applied to the others as well. I
don't want us to document it away now and then change our mind in next release.
Our end decision should be final.
Note that so far I don't see how this temporary workaround can be extended
for ResultTasks. The only way I truly see this is to make it sorted. Which
like you said I think in spark is very expensive and you would have to force it
because you don't know if it would fail or not. It also may only apply to
certain operations we control like zip.
> There should be a lot of usage within mllib (atleast yahoo's internal
BigML library did have a lot of it).
Do you know more specifics here, what are the ML libraries doing to cause
this, some sort of sampling or is it just the algorithm generates slightly
different on the order of shuffle?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]