Gyula Fora created FLINK-3431:
---------------------------------
Summary: Add retrying logic for RocksDB snapshots
Key: FLINK-3431
URL: https://issues.apache.org/jira/browse/FLINK-3431
Project: Flink
Issue Type: Improvement
Components: Streaming
Reporter: Gyula Fora
Priority: Critical
Currently the RocksDB snapshots rely on hdfs copy not failing while taking the
snapshots.
In some cases when the state size is big enough the HDFS nodes might get so
overloaded that the copy operation fails on errors like this:
AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 are
bad. Aborting...}
at
org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545)
Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad.
Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483)
I think it would be important that we don't immediately fail the job in these
cases but retry the copy operation after some random sleep time. It might be
also good to do a random sleep before the copy depending on the state size to
smoothen out IO a little bit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)