gaoyunhaii opened a new pull request #25: URL: https://github.com/apache/flink-ml/pull/25
We have met some more issues in supporting checkpoints with iteration: 1. Support snapshotting Replayer Operator and per-round wrapper. 2. Fix the issues that after head tasks finished, the coordinator continues emitting CoordinatorCheckpointEvent, which made the checkpoint fails due to events not sending. 3. Support raw operator state inside the iteration. Specially for the second point, Currently the HeadCoordinator would emit CoordinatorCheckpointEvent to the tasks so that the GloballyAlignedEvent would not be interleave with the checkpoint barrier. However, if the tasks are finished and we continue emitting the event, the checkpoint would fail due to there are failed operator events. To address this issue, we would stop emitting CoordinatorCheckpointEvent after the head operator is terminating, namely it received the GloballyAlignedEvent marking terminating. The third point is required since operators like `withBroadcast` might rely on raw operator state to snapshot the cached records. This is necessary since the normal operator state always resides in the memory, which might be not enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org