[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

TanYuxin (JIRA) Mon, 30 Oct 2017 23:26:28 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


TanYuxin updated HDFS-12749:
----------------------------
    Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart，DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
        at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes Block Report  is not 
sent correctly after register. 

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart，DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
        at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again, which makes Block Report  is not sent


> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 180+ 
> million. When NN restart, NN's load is very high.
> After SNN restart，DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
> Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
> "namenode.host.03":8040;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> If encountering a IOException in BPServiceActor#register,  
> scheduler.scheduleBlockReport method can't be run, and the Block Report will 
> not be sent immediately. 
> But NN has get the register RPC, and successfully register the DN. So NN will 
> not make DN register again at next HeartBeat, which makes Block Report  is 
> not sent correctly after register. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

Reply via email to