[
https://issues.apache.org/jira/browse/FLINK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500931#comment-17500931
]
Matthias Pohl edited comment on FLINK-26450 at 3/3/22, 5:40 PM:
----------------------------------------------------------------
Tests become flaky due to this change, e.g. [this
build|https://dev.azure.com/mapohl/flink/_build/results?buildId=808&view=results]
{code}
2022-03-03 14:30:11,282 WARN
org.apache.flink.runtime.checkpoint.OperatorSubtaskState [] - Error while
discarding operator states.
java.io.IOException:
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-47072687872/savepoint-e2e-test-chckpt-dir/b570100734a17ad72d8d2ccc712f681d/chk-11/73833c1e-bc28-4d68-8752-496d0ea65e8b
could not be deleted for unknown reasons.
at
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:86)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.KeyGroupsStateHandle.discardState(KeyGroupsStateHandle.java:125)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.discardState(OperatorSubtaskState.java:211)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.discardState(TaskStateSnapshot.java:156)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$1.run(CheckpointCoordinator.java:2007)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_322]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
{code}
The error is logged in
[CheckpointCoordinator:2009|https://github.com/apache/flink/blob/d91cb003221d65e07e135d510ff897f7520add6f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L2009]
was (Author: mapohl):
Tests become flaky due to this change, e.g. [this
build|https://dev.azure.com/mapohl/flink/_build/results?buildId=808&view=results]
{code}
2022-03-03 14:30:11,282 WARN
org.apache.flink.runtime.checkpoint.OperatorSubtaskState [] - Error while
discarding operator states.
java.io.IOException:
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-47072687872/savepoint-e2e-test-chckpt-dir/b570100734a17ad72d8d2ccc712f681d/chk-11/73833c1e-bc28-4d68-8752-496d0ea65e8b
could not be deleted for unknown reaso
at
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:86)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.KeyGroupsStateHandle.discardState(KeyGroupsStateHandle.java:125)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.discardState(OperatorSubtaskState.java:211)
~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.discardState(TaskStateSnapshot.java:156)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$1.run(CheckpointCoordinator.java:2007)
[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_322]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
{code}
The error is logged in
[CheckpointCoordinator:2009|https://github.com/apache/flink/blob/d91cb003221d65e07e135d510ff897f7520add6f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L2009]
> FileStateHandle.discardState does not process return value
> ----------------------------------------------------------
>
> Key: FLINK-26450
> URL: https://issues.apache.org/jira/browse/FLINK-26450
> Project: Flink
> Issue Type: Bug
> Components: Connectors / FileSystem, Runtime / Coordination
> Affects Versions: 1.15.0, 1.13.6, 1.14.3
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> The retryable cleanup does not work properly if there's an error appearing
> during the {{FileSystem.delete}} call which is used within
> [FileStateHandle.discardState|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FileStateHandle.java#L85].
> Some {{FileSystem}} implementations (e.g. S3 presto; see
> [PrestoS3FileSystem:512|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L512]
> through [PrestoS3FileSystem.delete(Path,
> boolean)|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L480])
> return {{false}} in case of an error which will be swallowed in
> {{FileStateHandle.discardState}}.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)