gaoyunhaii opened a new pull request #25:
URL: https://github.com/apache/flink-ml/pull/25


   We have met some more issues in supporting checkpoints with iteration:
   
   1. Support snapshotting Replayer Operator and per-round wrapper.
   2. Fix the issues that after head tasks finished, the coordinator continues 
emitting CoordinatorCheckpointEvent, which made the checkpoint fails due to 
events not sending.
   3. Support raw operator state inside the iteration. 
   
   Specially for the second point,  Currently the HeadCoordinator would emit 
CoordinatorCheckpointEvent to the tasks so that the GloballyAlignedEvent would 
not be interleave with the checkpoint barrier. However, if the tasks are 
finished and we continue emitting the event, the checkpoint would fail due to 
there are failed operator events. To address this issue, we would stop emitting 
CoordinatorCheckpointEvent after the head operator is terminating, namely it 
received the GloballyAlignedEvent marking terminating.
   
   The third point is required since operators like `withBroadcast` might rely 
on raw operator state to snapshot the cached records. This is necessary since 
the normal operator state always resides in the memory, which might be not 
enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to