[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

Henrique Barros (JIRA) Fri, 17 Aug 2018 12:03:27 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584285#comment-16584285
 ]


Henrique Barros commented on HDFS-13833:
----------------------------------------

*Context:* We are migrating from HortonWorks Hadoop v2.3 to this one. The POC 
is clucial since we decomissioned the HW nodes for installing this Cloudera's 
POC. With this Random error we cannot accept the solution. At least without 
knowing the real cause.

We already tried turning that off (dfs.namenode.replication.considerLoad) and 
it works, but it is only hiding the problem.


It is not because of the load. Our load is really really low across all the 
cluster - 2 NN and one DN.
Disks, CPU, Memory are all sleeping, we do not have network issues, nor disk 
issues; we are getting around 1 GBits per second between all the 3 machines.

It seems to me that the node is being excluded by some reason that we cannot 
find in the logs and then the total load becomes equal to 0 and the message:
{code:java}
load: 8 > 0.0{code}
Shows off. Sometimes that load is 2 other times is 10, but the total load 
(number on the right) is always zero which seems like a consequence of the only 
DN being excluded.

Do you know some other crucial classes I can activate DEBUG logs on, in order 
to find more about this?

Any Help is appreciated, we already tried so many configurations, including 
raising the Cloudera CDH version (it is now the one in description box), even 
tried raising our Flink version from 1.3.2 to 1.6.0, and the same happens.

Flink is our client, and this exception only happens with the Flink Checkpoints 
pointing to HDFS.

 

Best Regards,

Barros

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13833
>                 URL: https://issues.apache.org/jira/browse/HDFS-13833
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Henrique Barros
>            Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PM        INFO    BlockPlacementPolicy    
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PM        DEBUG   StateChange     
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PM        DEBUG   StateChange     
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
>       at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
> {code}
> This part makes no sense at all:
> {code:java}
> load: 8 > 0.0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

Reply via email to