Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
> how does the user then tell spark that the result stage becomes
repeatable because they did the checkpoint?
There are 2 concepts here:
1. The random level of the RDD computing function (see my PR description).
There are 3 random levels: IDEMPOTENT, RANDOM_ORDER, COMPLETE_RANDOM. e.g. file
reading is IDEMPOTENT, shuffle fetching is RANDOM_ORDER, shuffle fetching +
repartition/zip is COMPLETE_RANDOM. Spark only needs to retry the succeeding
stages if we retry a stage which is COMPLETE_RANDOM.
2. Whether the result stage is repeatable. e.g. "collect" is repeatable,
writing with hadoop output committer is not.
For concept 1, it's a property of RDD, so users can specify it by
implementing a custom RDD, or marking the RDD map function as
order-sensitive(e.g. `zip`). This PR does not design proper public APIs for it.
For concept 2, it's a property of the RDD action. Users usually don't need
to specify it, as we will specify it for each RDD action. e.g. `collect` is
repeatable. `saveAsHadoopDataset` is not.
Spark only fails the job if the RDD is COMPLETE_RANDOM (shuffle +
repartition/zip), and the action is not repeatable. If users checkpoint the RDD
before repartition/zip(e.g. shuffle + checkpoint + repartition/zip), then the
RDD becomes IDEMPOTENT(see ) and Spark will not fail the job even if the action
is not repeatable.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]