Github user harishreedharan commented on the pull request:
https://github.com/apache/spark/pull/5008#issuecomment-78852368
@tdas - Actually this fixes one part of the problem, which is caused by
starting of checkpoint at the time the job is generated.
But this can still cause an issue if you set `concurrentJobs` is pretty
high. If you set that parameter high enough, a batch which may have started at
time `t + maxRememberDuration`, might end up completing and checkpointing
before a batch at time `t` if the batch at time `t` takes longer to get
processed.
I have seen people set `concurrentJobs` to be pretty high when the cluster
is large and the processing order is not exactly relevant. #4964 actually takes
care of that situation (which is what the maps are for). If that is not
required, there is an even easier fix than this one, which was in a previous
commit in that PR:
https://github.com/harishreedharan/spark/commit/fa93b871ba0fe22924ff0273e975e492a6a7043c
Simply keep track of the last completed batch and delete the checkpoint
when the checkpoint time is actually the same as the last completed batch.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]