[ https://issues.apache.org/jira/browse/HDFS-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140735#comment-14140735 ]
Hadoop QA commented on HDFS-7097: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670009/HDFS-7097.patch against trunk revision bf27b9c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs-nfs: org.apache.hadoop.hdfs.nfs.nfs3.TTests org.apache.hadoop.hdfs.nfs.nfs3.TestOpenFilTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8106//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8106//console This message is automatically generated. > Allow block reports to be processed during checkpointing on standby name node > ----------------------------------------------------------------------------- > > Key: HDFS-7097 > URL: https://issues.apache.org/jira/browse/HDFS-7097 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Attachments: HDFS-7097.patch > > > On a reasonably busy HDFS cluster, there are stream of creates, causing data > nodes to generate incremental block reports. When a standby name node is > checkpointing, RPC handler threads trying to process a full or incremental > block report is blocked on the name system's {{fsLock}}, because the > checkpointer acquires the read lock on it. This can create a serious problem > if the size of name space is big and checkpointing takes a long time. > All available RPC handlers can be tied up very quickly. If you have 100 > handlers, it only takes 34 file creates. If a separate service RPC port is > not used, HA transition will have to wait in the call queue for minutes. Even > if a separate service RPC port is configured, hearbeats from datanodes will > be blocked. A standby NN with a big name space can lose all data nodes after > checkpointing. The rpc calls will also be retransmitted by data nodes many > times, filling up the call queue and potentially causing listen queue > overflow. > Since block reports are not modifying any state that is being saved to > fsimage, I propose letting them through during checkpointing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)