[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Tue, 14 Aug 2018 20:10:46 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    > ... assume that computation is idempotent - we do not support non 
determinism in computation
    
    Ah this is a reasonable restriction, we should document it in the RDD 
classdoc. How about the source (root RDD or shuffle)? The output of reduce task 
is non-deterministic because Spark fetches multiple shuffle blocks at the same 
time and it's random which shuffle blocks can finish fetching first. External 
sorter has the same problem: the output order can change if spilling happens.
    
    Generally I think there are 3 directions:
    1. assume computing functions are idempotent, and also make Spark internal 
operations idempotent(reducer, external sorter, maybe more). I think this is 
hard to do, but should be the clearest semantic.
    2. assume computing functions are idempotent and are insensitive to the 
input data order. Then Spark internal operations can have different output 
orders. An example is adding sort before round-robin, which makes this 
computing functions insensitive to the input data order. But I don't think it's 
reasonable to apply this restriction to all computing functions.
    3. assume computing functions are random. This is not friendly to the 
scheduler, as it needs to be able to revert a finished task. We need to think 
about if it's possible to revert a result task.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to