[jira] [Commented] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

Arvid Heise (Jira) Fri, 26 Mar 2021 06:03:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309412#comment-17309412
 ]


Arvid Heise commented on FLINK-20654:
-------------------------------------

Merged another set of test tunings + improved logging into master as 
f0d5d3be89c9762fdf147077a90831e365cf6ba0..913ea8e398e8396d044c14a0911d8e134ec4377d.
 I'm closing this ticket as there have been no other failures in the past two 
weeks. When another issue occurs the new logging should help us to drill it 
down.

> Unaligned checkpoint recovery may lead to corrupted data stream
> ---------------------------------------------------------------
>
>                 Key: FLINK-20654
>                 URL: https://issues.apache.org/jira/browse/FLINK-20654
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.0, 1.12.1
>            Reporter: Arvid Heise
>            Assignee: Piotr Nowojski
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.12.2, 1.13.0
>
>
> Fix of FLINK-20433 shows potential corruption after recovery for all 
> variations of UnalignedCheckpointITCase.
> To reproduce, run UCITCase a couple hundreds times. The issue showed for me 
> in:
> - execute [Parallel union, p = 5]
> - execute [Parallel union, p = 10]
> - execute [Parallel cogroup, p = 5]
> - execute [parallel pipeline with remote channels, p = 5]
> with decreasing frequency.
> The issue manifests as one of the following issues:
> - stream corrupted exception
> - EOF exception
> - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
> - (for union) ArithmeticException overflow (because the number that should be 
> [0;100000] has been mis-deserialized)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

Reply via email to