Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/21698
so I think the assumption is that task results are idempotent but not
ordered. Sorry if that contradictory. The data itself has to be the same on
rerun but the order of things in there doesn't. That was my general
assumption. I think zip doesn't follow that though when the inputs aren't
ordered. Not sure if there are others spark supports, need to go through the
list I guess, unless someone already has?
I think we just need to document these operations and say the results can
be inconsistent if not sorted or perhaps give them an option to also sort.
Either that or we have to say we don't support unordered output at all in
Spark. Thoughts on just documenting zip or others with unordered input?
I don't think mapreduce and pig have this issue because they don't
internally support an operation like zip, everything is on key/values and
joins, groupby on the keys. User code there could generate it as well but I
would claim its the users fault there.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]