[
https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150376#comment-14150376
]
Hadoop QA commented on HDFS-7009:
---------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch
against trunk revision 5f16c98.
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.
{color:red}-1 findbugs{color}. The patch appears to introduce 1 new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console
This message is automatically generated.
> Active NN and standby NN have different live nodes
> --------------------------------------------------
>
> Key: HDFS-7009
> URL: https://issues.apache.org/jira/browse/HDFS-7009
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
> Assignee: Ming Ma
> Attachments: HDFS-7009.patch
>
>
> To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most
> cases, given DN sends HB and BR to NN regularly, if a specific RPC call
> fails, it isn't a big deal.
> However, there are cases where DN fails to register with NN during initial
> handshake due to exceptions not covered by RPC client's connection retry.
> When this happens, the DN won't talk to that NN until the DN restarts.
> {noformat}
> BPServiceActor
> public void run() {
> LOG.info(this + " starting to offer service");
> try {
> // init stuff
> try {
> // setup storage
> connectToNNAndHandshake();
> } catch (IOException ioe) {
> // Initial handshake, storage recovery or registration failed
> // End BPOfferService thread
> LOG.fatal("Initialization failed for block pool " + this, ioe);
> return;
> }
> initialized = true; // bp is initialized;
>
> while (shouldRun()) {
> try {
> offerService();
> } catch (Exception ex) {
> LOG.error("Exception in BPOfferService for " + this, ex);
> sleepAndLogInterrupts(5000, "offering service");
> }
> }
> ...
> {noformat}
> Here is an example of the call stack.
> {noformat}
> java.io.IOException: Failed on local exception: java.io.IOException: Response
> is null.; Host Details : local host is: "xxx"; destination host is:
> "yyy":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
> at org.apache.hadoop.ipc.Client.call(Client.java:1239)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Response is null.
> at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
> {noformat}
> This will create discrepancy between active NN and standby NN in terms of
> live nodes.
>
> Here is a possible scenario of missing blocks after failover.
> 1. DN A, B set up handshakes with active NN, but not with standby NN.
> 2. A block is replicated to DN A, B and C.
> 3. From standby NN's point of view, given A and B are dead nodes, the block
> is under replicated.
> 4. DN C is down.
> 5. Before active NN detects DN C is down, it fails over.
> 6. The new active NN considers the block is missing. Even though there are
> two replicas on DN A and B.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)