Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21698
> ... assume that computation is idempotent - we do not support non
determinism in computation
Ah this is a reasonable restriction, we should document it in the RDD
classdoc. How about the source (root RDD or shuffle)? The output of reduce task
is non-deterministic because Spark fetches multiple shuffle blocks at the same
time and it's random which shuffle blocks can finish fetching first. External
sorter has the same problem: the output order can change if spilling happens.
Generally I think there are 3 directions:
1. assume computing functions are idempotent, and also make Spark internal
operations idempotent(reducer, external sorter, maybe more). I think this is
hard to do, but should be the clearest semantic.
2. assume computing functions are idempotent and are insensitive to the
input data order. Then Spark internal operations can have different output
orders. An example is adding sort before round-robin, which makes this
computing functions insensitive to the input data order. But I don't think it's
reasonable to apply this restriction to all computing functions.
3. assume computing functions are random. This is not friendly to the
scheduler, as it needs to be able to revert a finished task. We need to think
about if it's possible to revert a result task.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]