Github user gyfora commented on the pull request:
https://github.com/apache/flink/pull/1305#issuecomment-153470424
After some initial discussion we @StephanEwen we came to the following
conclusions:
The current timestamp based approach has some limitations in terms of the
assumptions it makes about the clocks on the checkpoint coordinator which can
cause some issues in case of master failure. It is currently assumed that the
recovery timestamp assigned by the checkpoint coordinator is larger than the
last checkpoint timestamp assigned before failure. If the clock difference is
large enough this might be a problem, although fairly unlikely.
The alternative approach would be to store the id, key, value triplets in
the database (drop the timestamps). Snapshots written between two snapshots
would be written with an incremented checkpoint id (and multiple spills to the
same key would update the value not insert a new one every time). Cleanup could
then be performed by getting the next checkpoint id from the coordinator (which
would get it from zookeper). The drawback in this case would be an increased
complexity for the batch inserts as we need to be able to handle primary key
conflicts efficiently. While this is easy on some databases (in MySql) in
other cases (Derby) can be problematic.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---