[
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584285#comment-16584285
]
Henrique Barros commented on HDFS-13833:
----------------------------------------
*Context:* We are migrating from HortonWorks Hadoop v2.3 to this one. The POC
is clucial since we decomissioned the HW nodes for installing this Cloudera's
POC. With this Random error we cannot accept the solution. At least without
knowing the real cause.
We already tried turning that off (dfs.namenode.replication.considerLoad) and
it works, but it is only hiding the problem.
It is not because of the load. Our load is really really low across all the
cluster - 2 NN and one DN.
Disks, CPU, Memory are all sleeping, we do not have network issues, nor disk
issues; we are getting around 1 GBits per second between all the 3 machines.
It seems to me that the node is being excluded by some reason that we cannot
find in the logs and then the total load becomes equal to 0 and the message:
{code:java}
load: 8 > 0.0{code}
Shows off. Sometimes that load is 2 other times is 10, but the total load
(number on the right) is always zero which seems like a consequence of the only
DN being excluded.
Do you know some other crucial classes I can activate DEBUG logs on, in order
to find more about this?
Any Help is appreciated, we already tried so many configurations, including
raising the Cloudera CDH version (it is now the one in description box), even
tried raising our Flink version from 1.3.2 to 1.6.0, and the same happens.
Flink is our client, and this exception only happens with the Flink Checkpoints
pointing to HDFS.
Best Regards,
Barros
> Failed to choose from local rack (location = /default); the second replica is
> not found, retry choosing ramdomly
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-13833
> URL: https://issues.apache.org/jira/browse/HDFS-13833
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Henrique Barros
> Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>
> In my case we are getting this error very randomly (after some hours) and
> with only one Datanode (for now, we are trying this cloudera cluster for a
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default,
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PM DEBUG NetworkTopology
> Choosing random from 0 available nodes on node /default, scope=/default,
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PM DEBUG NetworkTopology
> chooseRandom returning null
> 2:38:20.527 PM DEBUG BlockPlacementPolicy
> [
> Node /default/192.168.220.53:50010 [
> Datanode 192.168.220.53:50010 is not chosen since the node is too busy
> (load: 8 > 0.0).
> 2:38:20.527 PM DEBUG NetworkTopology
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PM INFO BlockPlacementPolicy
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PM DEBUG StateChange
> closeFile:
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
> with 1 blocks is persisted to the file system
> 2:38:20.527 PM DEBUG StateChange
> *BLOCK* NameNode.addBlock: file
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
> fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PM DEBUG BlockPlacementPolicy
> Failed to choose from local rack (location = /default); the second replica is
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
> {code}
> This part makes no sense at all:
> {code:java}
> load: 8 > 0.0{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]