Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/4964#issuecomment-81914505
  
    So in that version of the PR, I remove the metadata in the checkpoint only 
after the batch has been completed. So what it does is:
    - when a batch is completed, update `lastCompletedBatch`
    - when a checkpoint is completed, if the time is same as 
`lastCompletedBatch`, then clear metadata, else don't - so pre-batch-start 
checkpoint does not clean the metadata anyway. (I can admittedly use better way 
of comparing, rather than the `Seq.min`
    
    This gets the same result as #5008, but it is not enough when we could have 
multiple jobs running at once. When multiple jobs are running at once, we need 
to keep track of which jobs have been started and clear the metadata only for 
the oldest batch that has been completed and their post-completion checkpoint 
are complete. Any batch newer than that one, which may have completed and then 
checkpointed should not clean any metadata - that is what this PR's current 
state is. (There is some code to move the oldest processed batch forward once 
the oldest one has been processed - the iterator stuff, so any newer batches 
which have been completed and checkpoint can be removed from tracking.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to