[ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585717#comment-16585717
 ] 

Henrique Barros edited comment on HDFS-13833 at 8/20/18 9:54 AM:
-----------------------------------------------------------------

Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is 
happening. The only thing that conflicts with it is the 'message of the load' 
being out of order with the message that informs about the excluded the node. 
It appears it exclude it first and then print the load message.

However, you could reproduce it and your analysis makes total sense, it could 
only be inconsistent stats - chooseTarget and  {{sendHeartbeat}} being invoking 
at same time. But if it is, the stats should be saved someway till the next 
heartbeat, I think.
Your explanation still can explain why it happens so randomly.

I will maintain 'considerLoad' deactivated for now and will check if it happens 
with more than one dataNode.

 

Thank you very much for your fastest response and help.
 I will come back with some conclusion to this issue soon to see if we can 
close it or recheck.

 


was (Author: rikeppb100):
Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is 
happening. The only thing that conflicts with it is the 'message of the load' 
being out of order with the message that informs about the excluded the node. 
It appears it exclude it first and then print the load message.


However, you could reproduce it and your analysis makes total sense, it could 
only be inconsistent stats - chooseTarget and  {{sendHeartbeat}} being invoking 
at same time. But if it is, the stats should be saved someway till the next 
heartbeat, I think.

I will maintain 'considerLoad' deactivated for now and will check if it happens 
with more than one dataNode.

 

Thank you very much for your fastest response and help.
I will come back with some conclusion to this issue soon to see if we can close 
it or recheck.

 

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13833
>                 URL: https://issues.apache.org/jira/browse/HDFS-13833
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Henrique Barros
>            Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PM        INFO    BlockPlacementPolicy    
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PM        DEBUG   StateChange     
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PM        DEBUG   StateChange     
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
>       at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
> {code}
> This part makes no sense at all:
> {code:java}
> load: 8 > 0.0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to