[ 
https://issues.apache.org/jira/browse/FLINK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500931#comment-17500931
 ] 

Matthias Pohl edited comment on FLINK-26450 at 3/3/22, 5:40 PM:
----------------------------------------------------------------

Tests become flaky due to this change, e.g. [this 
build|https://dev.azure.com/mapohl/flink/_build/results?buildId=808&view=results]
{code}
2022-03-03 14:30:11,282 WARN  
org.apache.flink.runtime.checkpoint.OperatorSubtaskState     [] - Error while 
discarding operator states.
java.io.IOException: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-47072687872/savepoint-e2e-test-chckpt-dir/b570100734a17ad72d8d2ccc712f681d/chk-11/73833c1e-bc28-4d68-8752-496d0ea65e8b
 could not be deleted for unknown reasons.
        at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:86)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.KeyGroupsStateHandle.discardState(KeyGroupsStateHandle.java:125)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.discardState(OperatorSubtaskState.java:211)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.discardState(TaskStateSnapshot.java:156)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$1.run(CheckpointCoordinator.java:2007)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
{code}

The error is logged in 
[CheckpointCoordinator:2009|https://github.com/apache/flink/blob/d91cb003221d65e07e135d510ff897f7520add6f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L2009]


was (Author: mapohl):
Tests become flaky due to this change, e.g. [this 
build|https://dev.azure.com/mapohl/flink/_build/results?buildId=808&view=results]
{code}
2022-03-03 14:30:11,282 WARN  
org.apache.flink.runtime.checkpoint.OperatorSubtaskState     [] - Error while 
discarding operator states.
java.io.IOException: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-47072687872/savepoint-e2e-test-chckpt-dir/b570100734a17ad72d8d2ccc712f681d/chk-11/73833c1e-bc28-4d68-8752-496d0ea65e8b
 could not be deleted for unknown reaso
        at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:86)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.KeyGroupsStateHandle.discardState(KeyGroupsStateHandle.java:125)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.discardState(OperatorSubtaskState.java:211)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.discardState(TaskStateSnapshot.java:156)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$1.run(CheckpointCoordinator.java:2007)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
{code}

The error is logged in 
[CheckpointCoordinator:2009|https://github.com/apache/flink/blob/d91cb003221d65e07e135d510ff897f7520add6f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L2009]

> FileStateHandle.discardState does not process return value
> ----------------------------------------------------------
>
>                 Key: FLINK-26450
>                 URL: https://issues.apache.org/jira/browse/FLINK-26450
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem, Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.6, 1.14.3
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> The retryable cleanup does not work properly if there's an error appearing 
> during the {{FileSystem.delete}} call which is used within 
> [FileStateHandle.discardState|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FileStateHandle.java#L85].
>  Some {{FileSystem}} implementations (e.g. S3 presto; see 
> [PrestoS3FileSystem:512|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L512]
>  through [PrestoS3FileSystem.delete(Path, 
> boolean)|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L480])
>  return {{false}} in case of an error which will be swallowed in 
> {{FileStateHandle.discardState}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to