[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

mridulm Tue, 14 Aug 2018 11:17:05 -0700

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    @cloud-fan I think we have to be clear on the boundaries of the solution we 
can provide in spark.
    
    > RDD#mapPartitions and its friends can take arbitrary user functions, 
which may produce random result
    
    As stated above, this is something we do not support in spark.
    Spark, MR, etc assume that computation is idempotent - we do not support 
non determinism in computation : but computation could be sensitive to input 
order. For a given input partition (iterator of tuples), the closure must 
always generate the same output partition (iterator of tuples).
    
    Which is why 'randomness' (or rather pseudo-randomness) is seeded based 
using invariants like partition id which result in same output partition on 
task re-execution.
    
    The problem we have here is : even if user code satisfies this constraint, 
due to non determinism in input order, the output changes when closure is order 
sensitive.
    
    Given this, analyzing the statement below :
    
    > Step 3 is problematic: assuming we have 5 map tasks and 5 reduce tasks, 
and the input data is random. Let's say reduce task 1,2,3,4 are finished, 
reduce task 5 failed with FetchFailed, and Spark rerun map task 3 and 4. Map 
task 3 and 4 reprocess the random data and create diffrent shuffle blocks for 
reduce task 3, 4, 5. So not only reduce task 5 needs to rerun, reduce task 3, 
4, 5 all need to rerun, because their input data changed.
    
    Here - map task 3 and 4 will always produce the same output partition for 
supported closures - if the input partition remains same.
    When reading off checkpoint's, immutable hdfs files, etc - this invariant 
is satisfied.
    With @squito's suggestion implemented, this invariant will be satisfied for 
shuffle input as well.
    
    With deterministic input partition - we can see that output of map task 3 
and 4 will always be the same - and reduce task input's for 3/4/5 will be the 
same. So only reduce task 5 will need to be rerun and none of the other input's 
will change.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to