Github user harishreedharan commented on the pull request:
https://github.com/apache/spark/pull/4964#issuecomment-81914505
So in that version of the PR, I remove the metadata in the checkpoint only
after the batch has been completed. So what it does is:
- when a batch is completed, update `lastCompletedBatch`
- when a checkpoint is completed, if the time is same as
`lastCompletedBatch`, then clear metadata, else don't - so pre-batch-start
checkpoint does not clean the metadata anyway. (I can admittedly use better way
of comparing, rather than the `Seq.min`
This gets the same result as #5008, but it is not enough when we could have
multiple jobs running at once. When multiple jobs are running at once, we need
to keep track of which jobs have been started and clear the metadata only for
the oldest batch that has been completed and their post-completion checkpoint
are complete. Any batch newer than that one, which may have completed and then
checkpointed should not clean any metadata - that is what this PR's current
state is. (There is some code to move the oldest processed batch forward once
the oldest one has been processed - the iterator stuff, so any newer batches
which have been completed and checkpoint can be removed from tracking.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]