[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438531#comment-16438531
 ] 

yunjiong zhao commented on HDFS-13441:
--------------------------------------

[~hexiaoqiao] , this issue is different, it is not about DN register to NN, it 
is about lost heartbeat responses from Standby NameNode could end with some 
DataNodes missed the block key after Standby NameNode become active.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13441
>                 URL: https://issues.apache.org/jira/browse/HDFS-13441
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.7.1
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>            Priority: Major
>         Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55:00 and around 2018-04-11 19:55:00.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 02:54:47,855 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 05:55:44,456 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> {code}
> The reason is there is SocketTimeOutException when sending heartbeat to 
> StandbyNameNode
> {code:java}
> 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
> ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; 
> For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1430)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:155)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:427)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:543)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:725)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-04-11 19:56:13,308 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
> ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.142.77.45:50381 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; 
> For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1430)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:155)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:427)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:543)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:725)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.142.77.45:50381 
> remote=ares-nn.vip.ebay.com/10.103.108.200:8030]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:523)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1087)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:982)
> {code}
> In the Standby NameNode, we can see it dropped the HeartbeatResponses which 
> contain new BlockKey.
> {code:java}
> 2018-04-11 09:53:34,293 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 25 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:54371 Call#13364496 Retry#0
> 2018-04-11 09:56:42,436 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 8 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:48803 Call#13364569 Retry#0
> 2018-04-11 19:54:13,092 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 43 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:52870 Call#13385487 Retry#0
> 2018-04-11 19:57:21,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 50 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:35918 Call#13385562 Retry#0
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to