Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan I think we have to be clear on the boundaries of the solution we
can provide in spark.
> RDD#mapPartitions and its friends can take arbitrary user functions,
which may produce random result
As stated above, this is something we do not support in spark.
Spark, MR, etc assume that computation is idempotent - we do not support
non determinism in computation : but computation could be sensitive to input
order. For a given input partition (iterator of tuples), the closure must
always generate the same output partition (iterator of tuples).
Which is why 'randomness' (or rather pseudo-randomness) is seeded based
using invariants like partition id which result in same output partition on
task re-execution.
The problem we have here is : even if user code satisfies this constraint,
due to non determinism in input order, the output changes when closure is order
sensitive.
Given this, analyzing the statement below :
> Step 3 is problematic: assuming we have 5 map tasks and 5 reduce tasks,
and the input data is random. Let's say reduce task 1,2,3,4 are finished,
reduce task 5 failed with FetchFailed, and Spark rerun map task 3 and 4. Map
task 3 and 4 reprocess the random data and create diffrent shuffle blocks for
reduce task 3, 4, 5. So not only reduce task 5 needs to rerun, reduce task 3,
4, 5 all need to rerun, because their input data changed.
Here - map task 3 and 4 will always produce the same output partition for
supported closures - if the input partition remains same.
When reading off checkpoint's, immutable hdfs files, etc - this invariant
is satisfied.
With @squito's suggestion implemented, this invariant will be satisfied for
shuffle input as well.
With deterministic input partition - we can see that output of map task 3
and 4 will always be the same - and reduce task input's for 3/4/5 will be the
same. So only reduce task 5 will need to be rerun and none of the other input's
will change.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]