[jira] [Comment Edited] (FLINK-19300) Timer loss after restoring from savepoint

Tzu-Li (Gordon) Tai (Jira) Tue, 10 Nov 2020 23:10:56 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229796#comment-17229796
 ]


Tzu-Li (Gordon) Tai edited comment on FLINK-19300 at 11/11/20, 7:09 AM:
------------------------------------------------------------------------

Just a comment on the severity of the issue:

It looks like timer loss is only possible if somehow, the key groups contain a 
{{0}} at the very beginning of the stream. This seems to be the only possible 
case that would lead to the {{InternalTimerServiceSerializationProxy}} silently 
skipping the rest of the reads, instead of failing the restore with some 
{{IOException}}.


was (Author: tzulitai):
Just a comment on the severity of the issue:

It looks like timer loss is only possible if somehow, the key groups contain a 
{{0}} at the very beginning of the stream. This seems to be the only possible 
case that would lead to the {{InternalTimerServiceSerializationProxy}} silently 
skipping the rest of the reads.

> Timer loss after restoring from savepoint
> -----------------------------------------
>
>                 Key: FLINK-19300
>                 URL: https://issues.apache.org/jira/browse/FLINK-19300
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>            Reporter: Xiang Gao
>            Priority: Critical
>
> While using heap-based timers, we are seeing occasional timer loss after 
> restoring program from savepoint, especially when using a remote savepoint 
> storage (s3). 
> After some investigation, the issue seems to be related to [this line in 
> deserialization|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/io/PostVersionedIOReadableWritable.java#L65].
>  When trying to check the VERSIONED_IDENTIFIER, the input stream may not 
> guarantee filling the byte array, causing timers to be dropped for the 
> affected key group.
> Should keep reading until expected number of bytes are actually read or if 
> end of the stream has been reached. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-19300) Timer loss after restoring from savepoint

Reply via email to