[GitHub] spark pull request: [SPARK-8582][Core] Add CheckpointingIterator t...

andrewor14 Mon, 06 Jul 2015 16:57:58 -0700

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/7021#issuecomment-119030716
  
    @viirya Thanks for the tackling this issue, but I believe the existing 
implementation is not fully correct.
    
    There are two high level problems: First, if the checkpointing iterator is 
not fully consumed by the user, then we end up checkpointing only a subset of 
the computed data. I think we should ensure that the iterator is fully drained 
before we can safely truncate the RDD's lineage through `rdd.markCheckpointed`.
    
    Second, the state transition from `Initialized` -> 
`CheckpointingInProgress` -> `Checkpointed` is not respected. In the new model, 
we should transition into `CheckpointingInProgress` as soon as the iterator is 
returned (so multiple calls to it will not lead to the RDD being checkpointed 
many times). Then only after we fully iterate through the iterator can we 
declare the RDD as `Checkpointed`.
    
    I actually don't have a great idea on how to fix the first issue, however. 
We do not really have any visibility on how the higher level caller with use 
the iterator, and if we consume it eagerly ourselves then the application might 
fail. @tdas this seems like a fundamentally difficult problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8582][Core] Add CheckpointingIterator t...

Reply via email to