Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/7021#issuecomment-119030716
@viirya Thanks for the tackling this issue, but I believe the existing
implementation is not fully correct.
There are two high level problems: First, if the checkpointing iterator is
not fully consumed by the user, then we end up checkpointing only a subset of
the computed data. I think we should ensure that the iterator is fully drained
before we can safely truncate the RDD's lineage through `rdd.markCheckpointed`.
Second, the state transition from `Initialized` ->
`CheckpointingInProgress` -> `Checkpointed` is not respected. In the new model,
we should transition into `CheckpointingInProgress` as soon as the iterator is
returned (so multiple calls to it will not lead to the RDD being checkpointed
many times). Then only after we fully iterate through the iterator can we
declare the RDD as `Checkpointed`.
I actually don't have a great idea on how to fix the first issue, however.
We do not really have any visibility on how the higher level caller with use
the iterator, and if we consume it eagerly ourselves then the application might
fail. @tdas this seems like a fundamentally difficult problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]