hfutatzhanghb commented on PR #7810: URL: https://github.com/apache/hadoop/pull/7810#issuecomment-3232146220
@Hexiaoqiao Thanks very much for reviewing. Please allow me to define issue clearly here. Recently, while exploring the use of HDFS Erasure Coding (EC) for hot-data storage, we encountered some problems and the current issue is one of them. **Problem description (pseudo-code):** ```java DFSStripedOutputStream os = dfs.create(path); // The task may run for several hours, so the output-stream object is also held open for hours. while (task is not finished) { data = doSomeComputeLogicAndGetData(); os.write(data); } os.close(); ``` When we perform a rolling restart of DataNodes, the above task fails. The root cause is that, during writing, an EC output stream will exclude any bad DataNode from the pipeline, but there is no mechanism to add new DataNodes to replace the excluded ones. Once more than three DataNodes have been excluded, the output stream no longer has enough DataStreamers to continue writing and therefore aborts. So, this pr try to resolve such problem by ending block group in advance when meet failed streamers(count of streamer <= 3)to allocate new block, after allocating new block, we wil have sufficient data streamer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org