[GitHub] [flink] XComp edited a comment on pull request #18963: [FLINK-26450][fs] Adds error handling to FileStateHandle.discardState

GitBox Thu, 03 Mar 2022 10:57:45 -0800


XComp edited a comment on pull request #18963:
URL: https://github.com/apache/flink/pull/18963#issuecomment-1058378672



   I had to revisit the issue because I noticed that the `FileSystem.delete` 
method is not clear on cases where the underlying file doesn't exist. The 
`LocalFileSystem` implements the delete method in a way that it would return 
`false` if it didn't delete the file since it relies on `java.io.File.delete`
   
   This was probably the cause for [this 
build](https://dev.azure.com/mapohl/flink/_build/results?buildId=808&view=logs&j=d63a5fc4-24ea-51df-9ade-fa4330af161c&t=977479f1-49ea-5c4c-884c-4646ed1443ab)
 to fail in the e2e tests:
   ```
   2022-03-03 14:30:11,092 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 11 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1646317811091 for job 
b570100734a17ad72d8d2ccc712f6
   81d.
   2022-03-03 14:30:11,215 INFO  org.apache.flink.runtime.jobmaster.JobMaster   
              [] - Triggering stop-with-savepoint for job 
b570100734a17ad72d8d2ccc712f681d.
   2022-03-03 14:30:11,232 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 12 (type=SavepointType{name='Suspend Savepoint', 
postCheckpointAction=SUSPEND, formatType=CANONICAL}) @ 1646317811228 for job 
b570100734
   a17ad72d8d2ccc712f681d.
   2022-03-03 14:30:11,259 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late 
message for now expired checkpoint attempt 11 from task 
275909f41c4e9d1635d1c3d3c1f55b4c of job b570100734a17ad72d8d2ccc712f681d at 
127.0.0.1:34
   655-d7bf22 @ localhost (dataPort=37055).
   [...]
   2022-03-03 14:30:11,282 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late 
message for now expired checkpoint attempt 11 from task 
f827493a1120315cebf2c38987fb2709 of job b570100734a17ad72d8d2ccc712f681d at 
127.0.0.1:34
   655-d7bf22 @ localhost (dataPort=37055).
   2022-03-03 14:30:11,288 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late 
message for now expired checkpoint attempt 11 from task 
bb54c8be2cceb115193c02f53ce3cf3e of job b570100734a17ad72d8d2ccc712f681d at 
127.0.0.1:34
   655-d7bf22 @ localhost (dataPort=37055).
   2022-03-03 14:30:11,282 WARN  
org.apache.flink.runtime.checkpoint.OperatorSubtaskState     [] - Error while 
discarding operator states.
   java.io.IOException: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-47072687872/savepoint-e2e-test-chckpt-dir/b570100734a17ad72d8d2ccc712f681d/chk-11/73833c1e-bc28-4d68-8752-496d0ea65e8b
 could not be deleted for unknown reasons.
           at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:86)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.state.KeyGroupsStateHandle.discardState(KeyGroupsStateHandle.java:125)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.discardState(OperatorSubtaskState.java:211)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.util.LambdaUtil.applyToAllWhileSuppressingExceptions(LambdaUtil.java:55)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.state.StateUtil.bestEffortDiscardAllStateObjects(StateUtil.java:62)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.discardState(TaskStateSnapshot.java:156)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$1.run(CheckpointCoordinator.java:2007)
 [flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
           at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
   ```
   
   @rkhachatryan Is it possible that states are missed to be persisted to disk 
when there's a concurrent savepoint operation happening?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] XComp edited a comment on pull request #18963: [FLINK-26450][fs] Adds error handling to FileStateHandle.discardState

Reply via email to