[ 
https://issues.apache.org/jira/browse/FLINK-24667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Khachatryan resolved FLINK-24667.
---------------------------------------
    Resolution: Fixed

Merged as 

61b217f72369c4ff4ff3e3d1bb16232bb07bbc7f to 1.13,

2ba57e29b8c7cc7d48f5313f0aeb96960c0796f6 to 1.14,

d1997b827a0e21308c57450dd7a6df1e8efa5bce to master.

> Channel state writer would fail the task directly if meeting exception 
> previously
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-24667
>                 URL: https://issues.apache.org/jira/browse/FLINK-24667
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Task
>    Affects Versions: 1.14.0, 1.13.3
>            Reporter: Yun Tang
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0, 1.14.1, 1.13.4
>
>
> Currently, if channel state writer come across exception when closing a file, 
> such as meet exception during 
> {{SubtaskCheckpointCoordinatorImpl#cancelAsyncCheckpointRunnable}}, it will 
> exit the loop. However, in the following {{channelStateWriter#abort}} it will 
> throw exception directly:
> {code:java}
> switched from RUNNING to FAILED with failure cause: java.io.IOException: 
> java.lang.RuntimeException: unable to send request to worker
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:228)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.checkPartitionRequestQueueInitialized(RemoteInputChannel.java:735)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:204)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:651)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:626)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:612)
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:109)
>       at 
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:149)
>       at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
>       at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
>       at 
> org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:98)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424)
>       at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:798)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
> {code}
> This is not expected as checkpoint failure should not lead to task failover 
> each time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to