[jira] [Commented] (FLINK-21080) Identify JobVertex containing legacy source operators and abort checkpoint with legacy source operators partially finished

Piotr Nowojski (Jira) Wed, 21 Jul 2021 06:53:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384910#comment-17384910
 ]


Piotr Nowojski commented on FLINK-21080:
----------------------------------------

{quote}
I think this issue indeed exists: with union list state it may rely on a single 
task to write the value to implement a broadcast state pattern, but if this 
task is finished, this piece of information is lost.
{quote}
A note. {{BroadcastState}} I think is something else compared to the 
{{UnionListState}}, is it? Also with {{UnionListState}} it doesn't necessary 
must only single {{subtaskId == 0}} subtask writing/updating the state. There 
might be cases were there are many writers.
{quote}
It seems that the probability of we met this case (an operator use union list 
state and part of subtask is finished, and we just take a checkpoint) would be 
not too high
{quote}
{{FlinkKafkaProducer}} is using {{UnionListState}}. So it wouldn't be as un 
common as you think? 

Let's see if maybe others will come with a different proposals. Also it would 
be nice to estimate complexity of supporting finished subtasks with 
{{UnionListState}}, but I would lean towards an easy stop-gap solution 
(aborting checkpoints if {{UnionListState}} is detected with finished subtasks) 
to unblock this FLIP, and maybe improve it incrementally.

> Identify JobVertex containing legacy source operators and abort checkpoint 
> with legacy source operators partially finished
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-21080
>                 URL: https://issues.apache.org/jira/browse/FLINK-21080
>             Project: Flink
>          Issue Type: Sub-task
>          Components: API / DataStream, Runtime / Checkpointing
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>              Labels: auto-unassigned
>
> Most legacy source operators would record the offset for each partitions, and 
> after recovery it would read from the recorded offset. If before a checkpoint 
> some subtasks are finished, the corresponding partition offsets would be 
> deserted in the checkpoint. Then if the job recover with this checkpoint, the 
> legacy source would re-discovery all the partitions and for those finished 
> tasks, the legacy source would re-read them since their offsets are not 
> recorded. 
> Therefore, we would like to fail the checkpoint if some legacy source 
> operators have part of subtasks finished. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21080) Identify JobVertex containing legacy source operators and abort checkpoint with legacy source operators partially finished

Reply via email to