[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Wed, 22 Aug 2018 07:13:05 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    > how does the user then tell spark that the result stage becomes 
repeatable because they did the checkpoint?
    
    There are 2 concepts here:
    1. The random level of the RDD computing function (see my PR description). 
There are 3 random levels: IDEMPOTENT, RANDOM_ORDER, COMPLETE_RANDOM. e.g. file 
reading is IDEMPOTENT, shuffle fetching is RANDOM_ORDER, shuffle fetching + 
repartition/zip is COMPLETE_RANDOM. Spark only needs to retry the succeeding 
stages if we retry a stage which is COMPLETE_RANDOM.
    2. Whether the result stage is repeatable. e.g. "collect" is repeatable, 
writing with hadoop output committer is not.
    
    For concept 1, it's a property of RDD, so users can specify it by 
implementing a custom RDD, or marking the RDD map function as 
order-sensitive(e.g. `zip`). This PR does not design proper public APIs for it.
    
    For concept 2, it's a property of the RDD action. Users usually don't need 
to specify it, as we will specify it for each RDD action. e.g. `collect` is 
repeatable. `saveAsHadoopDataset` is not.
    
    Spark only fails the job if the RDD is COMPLETE_RANDOM (shuffle + 
repartition/zip), and the action is not repeatable. If users checkpoint the RDD 
before repartition/zip(e.g. shuffle + checkpoint + repartition/zip), then the 
RDD becomes IDEMPOTENT(see ) and Spark will not fail the job even if the action 
is not repeatable.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to