[jira] [Comment Edited] (HDFS-12749) DN may not send block report to NN after NN restart

Erik Krogen (JIRA) Fri, 23 Feb 2018 11:10:24 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374858#comment-16374858
 ]


Erik Krogen edited comment on HDFS-12749 at 2/23/18 7:09 PM:
-------------------------------------------------------------

v003 patch LGTM! Thanks for pushing this [~hexiaoqiao] and sorry for the slow 
response.

I wanted to dig deeper to understand why the {{SocketTimeoutException}} was 
over-wrapped in an extra {{IOException}}, which was also a cause of this 
problem. As far as I can tell, it comes from this code block within 
{{Client.Connection#setupIOStreams}}:
{code}
            try {
              authMethod = ticket
                  .doAs(new PrivilegedExceptionAction<AuthMethod>() {
                    @Override
                    public AuthMethod run()
                        throws IOException, InterruptedException {
                      return setupSaslConnection(ipcStreams);
                    }
                  });
            } catch (IOException ex) {
              if (saslRpcClient == null) {
                // whatever happened -it can't be handled, so rethrow
                throw ex;
              }
              // otherwise, assume a connection problem
              authMethod = saslRpcClient.getAuthMethod();
              if (rand == null) {
                rand = new Random();
              }
              handleSaslConnectionFailure(numRetries++, maxRetriesOnSasl, ex,
                  rand, ticket);
              continue;
            }
{code}
The signature of {{handleSaslConnectionFailure}} accepts an {{Exception}} 
rather than {{IOException}} (even though this is its only usage), so it 
additionally wraps this in a new {{IOException}}:
{code}
    private synchronized void handleSaslConnectionFailure(
        final int currRetries, final int maxRetries, final Exception ex,
        final Random rand, final UserGroupInformation ugi) throws IOException,
        InterruptedException {
          ...
          if (shouldAuthenticateOverKrb()) {
            if (currRetries < maxRetries) {
              ...
              return null;
            } else {
              ...
              throw (IOException) new IOException(msg).initCause(ex);
            }
          } else {
            LOG.warn("Exception encountered while connecting to "
                + "the server : " + ex);
          }
          if (ex instanceof RemoteException)
            throw (RemoteException) ex;
          throw new IOException(ex);
        }
      });
    }
{code}
It looks like the case where it wraps with no additional message would only 
occur in an unkerberized cluster. Are you running without Kerberos? Do you see 
a relevant WARN statement from the log for {{ipc.Client}}? In any case I think 
we should modify {{handleSaslConnectionFailure}} to accept {{IOException}} and 
avoid the additional wrap.


was (Author: xkrogen):
v003 patch LGTM! Thanks for pushing this [~hexiaoqiao] and sorry for the slow 
response.

I wanted to dig deeper to understand why the {{SocketTimeoutException}} was 
over-wrapped in an extra {{IOException}}, which was also a cause of this 
problem. As far as I can tell, it comes from this code block within 
{{Client.Connection#setupIOStreams}}:
{code}
            try {
              authMethod = ticket
                  .doAs(new PrivilegedExceptionAction<AuthMethod>() {
                    @Override
                    public AuthMethod run()
                        throws IOException, InterruptedException {
                      return setupSaslConnection(ipcStreams);
                    }
                  });
            } catch (IOException ex) {
              if (saslRpcClient == null) {
                // whatever happened -it can't be handled, so rethrow
                throw ex;
              }
              // otherwise, assume a connection problem
              authMethod = saslRpcClient.getAuthMethod();
              if (rand == null) {
                rand = new Random();
              }
              handleSaslConnectionFailure(numRetries++, maxRetriesOnSasl, ex,
                  rand, ticket);
              continue;
            }
{code}
The signature of {{handleSaslConnectionFailure}} accepts an {{Exception}} 
rather than {{IOException}} (even though this is its only usage), so it 
additionally wraps this in a new {{IOException}}:
{code}
    private synchronized void handleSaslConnectionFailure(
        final int currRetries, final int maxRetries, final Exception ex,
        final Random rand, final UserGroupInformation ugi) throws IOException,
        InterruptedException {
          ...
          if (shouldAuthenticateOverKrb()) {
            if (currRetries < maxRetries) {
              ...
              return null;
            } else {
              ...
              throw (IOException) new IOException(msg).initCause(ex);
            }
          } else {
            LOG.warn("Exception encountered while connecting to "
                + "the server : " + ex);
          }
          if (ex instanceof RemoteException)
            throw (RemoteException) ex;
          throw new IOException(ex);
        }
      });
    }
{code}
It looks like the case where it wraps with no additional message would only 
occur in an unkerberized cluster. Are you running without Kerberos? In any case 
I think we should modify {{handleSaslConnectionFailure}} to accept 
{{IOException}} and avoid the additional wrap.

> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>            Reporter: TanYuxin
>            Priority: Major
>         Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart，DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>    * Register one bp with the corresponding NameNode
>    * <p>
>    * The bpDatanode needs to register with the namenode on startup in order
>    * 1) to report which storage it is serving now and 
>    * 2) to receive a registrationID
>    *  
>    * issued by the namenode to recognize registered datanodes.
>    * 
>    * @param nsInfo current NamespaceInfo
>    * @see FSNamesystem#registerDatanode(DatanodeRegistration)
>    * @throws IOException
>    */
>   void register(NamespaceInfo nsInfo) throws IOException {
>     // The handshake() phase loaded the block pool storage
>     // off disk - so update the bpRegistration object from that info
>     DatanodeRegistration newBpRegistration = bpos.createRegistration();
>     LOG.info(this + " beginning handshake with NN");
>     while (shouldRun()) {
>       try {
>         // Use returned registration from namenode with updated fields
>         newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
>         newBpRegistration.setNamespaceInfo(nsInfo);
>         bpRegistration = newBpRegistration;
>         break;
>       } catch(EOFException e) {  // namenode might have just restarted
>         LOG.info("Problem connecting to server: " + nnAddr + " :"
>             + e.getLocalizedMessage());
>         sleepAndLogInterrupts(1000, "connecting to server");
>       } catch(SocketTimeoutException e) {  // namenode is busy
>         LOG.info("Problem connecting to server: " + nnAddr);
>         sleepAndLogInterrupts(1000, "connecting to server");
>       }
>     }
>     
>     LOG.info("Block pool " + this + " successfully registered with NN");
>     bpos.registrationSucceeded(this, bpRegistration);
>     // random short delay - helps scatter the BR from all DNs
>     scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-12749) DN may not send block report to NN after NN restart

Reply via email to