[jira] [Commented] (FLINK-3431) Add retrying logic for RocksDB snapshots

Gyula Fora (JIRA) Wed, 17 Feb 2016 04:45:31 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150442#comment-15150442
 ]


Gyula Fora commented on FLINK-3431:
-----------------------------------

I think we should do that too, but we should still try to retry the copy if 
that is what is the issue. Hoping that it will not fail the next time might not 
work out and you end up not being able to take any snapshots without this.

> Add retrying logic for RocksDB snapshots
> ----------------------------------------
>
>                 Key: FLINK-3431
>                 URL: https://issues.apache.org/jira/browse/FLINK-3431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Gyula Fora
>            Priority: Critical
>
> Currently the RocksDB snapshots rely on hdfs copy not failing while taking 
> the snapshots.
> In some cases when the state size is big enough the HDFS nodes might get so 
> overloaded that the copy operation fails on errors like this:
> AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 
> are bad. Aborting...}
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545)
> Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. 
> Aborting...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483)
> I think it would be important that we don't immediately fail the job in these 
> cases but retry the copy operation after some random sleep time. It might be 
> also good to do a random sleep before the copy depending on the state size to 
> smoothen out IO a little bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3431) Add retrying logic for RocksDB snapshots

Reply via email to