[jira] [Comment Edited] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

2018-08-20 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585717#comment-16585717
 ] 

Henrique Barros edited comment on HDFS-13833 at 8/20/18 9:54 AM:
-

Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is 
happening. The only thing that conflicts with it is the 'message of the load' 
being out of order with the message that informs about the excluded the node. 
It appears it exclude it first and then print the load message.

However, you could reproduce it and your analysis makes total sense, it could 
only be inconsistent stats - chooseTarget and  {{sendHeartbeat}} being invoking 
at same time. But if it is, the stats should be saved someway till the next 
heartbeat, I think.
Your explanation still can explain why it happens so randomly.

I will maintain 'considerLoad' deactivated for now and will check if it happens 
with more than one dataNode.

 

Thank you very much for your fastest response and help.
 I will come back with some conclusion to this issue soon to see if we can 
close it or recheck.

 


was (Author: rikeppb100):
Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is 
happening. The only thing that conflicts with it is the 'message of the load' 
being out of order with the message that informs about the excluded the node. 
It appears it exclude it first and then print the load message.


However, you could reproduce it and your analysis makes total sense, it could 
only be inconsistent stats - chooseTarget and  {{sendHeartbeat}} being invoking 
at same time. But if it is, the stats should be saved someway till the next 
heartbeat, I think.

I will maintain 'considerLoad' deactivated for now and will check if it happens 
with more than one dataNode.

 

Thank you very much for your fastest response and help.
I will come back with some conclusion to this issue soon to see if we can close 
it or recheck.

 

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> 
>
> Key: HDFS-13833
> URL: https://issues.apache.org/jira/browse/HDFS-13833
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Henrique Barros
>Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PMINFOBlockPlacementPolicy
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PMDEBUG   StateChange 
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PMDEBUG   StateChange 
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>   at 
> 

[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

2018-08-20 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585717#comment-16585717
 ] 

Henrique Barros commented on HDFS-13833:


Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is 
happening. The only thing that conflicts with it is the 'message of the load' 
being out of order with the message that informs about the excluded the node. 
It appears it exclude it first and then print the load message.


However, you could reproduce it and your analysis makes total sense, it could 
only be inconsistent stats - chooseTarget and  {{sendHeartbeat}} being invoking 
at same time. But if it is, the stats should be saved someway till the next 
heartbeat, I think.

I will maintain 'considerLoad' deactivated for now and will check if it happens 
with more than one dataNode.

 

Thank you very much for your fastest response and help.
I will come back with some conclusion to this issue soon to see if we can close 
it or recheck.

 

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> 
>
> Key: HDFS-13833
> URL: https://issues.apache.org/jira/browse/HDFS-13833
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Henrique Barros
>Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PMINFOBlockPlacementPolicy
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PMDEBUG   StateChange 
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PMDEBUG   StateChange 
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
>   at 
> 

[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

2018-08-17 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584285#comment-16584285
 ] 

Henrique Barros commented on HDFS-13833:


*Context:* We are migrating from HortonWorks Hadoop v2.3 to this one. The POC 
is clucial since we decomissioned the HW nodes for installing this Cloudera's 
POC. With this Random error we cannot accept the solution. At least without 
knowing the real cause.

We already tried turning that off (dfs.namenode.replication.considerLoad) and 
it works, but it is only hiding the problem.


It is not because of the load. Our load is really really low across all the 
cluster - 2 NN and one DN.
Disks, CPU, Memory are all sleeping, we do not have network issues, nor disk 
issues; we are getting around 1 GBits per second between all the 3 machines.

It seems to me that the node is being excluded by some reason that we cannot 
find in the logs and then the total load becomes equal to 0 and the message:
{code:java}
load: 8 > 0.0{code}
Shows off. Sometimes that load is 2 other times is 10, but the total load 
(number on the right) is always zero which seems like a consequence of the only 
DN being excluded.

Do you know some other crucial classes I can activate DEBUG logs on, in order 
to find more about this?

Any Help is appreciated, we already tried so many configurations, including 
raising the Cloudera CDH version (it is now the one in description box), even 
tried raising our Flink version from 1.3.2 to 1.6.0, and the same happens.

Flink is our client, and this exception only happens with the Flink Checkpoints 
pointing to HDFS.

 

Best Regards,

Barros

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> 
>
> Key: HDFS-13833
> URL: https://issues.apache.org/jira/browse/HDFS-13833
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Henrique Barros
>Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PMDEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PMINFOBlockPlacementPolicy
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PMDEBUG   StateChange 
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PMDEBUG   StateChange 
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PMDEBUG   BlockPlacementPolicy
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
>   at 
> 

[jira] [Commented] (HDFS-5970) callers of NetworkTopology's chooseRandom method to expect null return value

2018-08-17 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584198#comment-16584198
 ] 

Henrique Barros commented on HDFS-5970:
---

I just reproduced it returning null.
See the issue I created please:

https://issues.apache.org/jira/browse/HDFS-13833

> callers of NetworkTopology's chooseRandom method to expect null return value
> 
>
> Key: HDFS-5970
> URL: https://issues.apache.org/jira/browse/HDFS-5970
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Yongjun Zhang
>Priority: Minor
>
> Class NetworkTopology's method
>public Node chooseRandom(String scope) 
> calls 
>private Node chooseRandom(String scope, String excludedScope)
> which may return null value.
> Callers of this method such as BlockPlacementPolicyDefault etc need to be 
> aware that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2018-08-17 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584189#comment-16584189
 ] 

Henrique Barros edited comment on HDFS-10453 at 8/17/18 5:26 PM:
-

I have the same problem with Hadoop 2.6.0-cdh5.15.0
 With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21

 

In my case we are getting this error very randomly and with only one Datanode 
(for now).
 Here is the Log.
{code:java}
Choosing random from 1 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[]
2:38:20.527 PM  DEBUG   NetworkTopology 
Choosing random from 0 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[192.168.220.53:50010]
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning null
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
[
Node /default/192.168.220.53:50010 [
  Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 
8 > 0.0).
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning 192.168.220.53:50010
2:38:20.527 PM  INFOBlockPlacementPolicy
Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
2:38:20.527 PM  DEBUG   StateChange 
closeFile: 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
 with 1 blocks is persisted to the file system
2:38:20.527 PM  DEBUG   StateChange 
*BLOCK* NameNode.addBlock: file 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
Failed to choose from local rack (location = /default); the second replica is 
not found, retry choosing ramdomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

{code}
This part makes no sense at all:
{code:java}
load: 8 > 0.0{code}
 I created a dedicated Bug for this case since it could not have anything to do 
with this one:
https://issues.apache.org/jira/browse/HDFS-13833

 


was (Author: rikeppb100):
I have the same problem with Hadoop 2.6.0-cdh5.15.0
With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21

 

In my case we are 

[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2018-08-17 Thread Henrique Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584189#comment-16584189
 ] 

Henrique Barros commented on HDFS-10453:


I have the same problem with Hadoop 2.6.0-cdh5.15.0
With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21

 

In my case we are getting this error very randomly and with only one Datanode 
(for now).
Here is the Log.
{code:java}
Choosing random from 1 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[]
2:38:20.527 PM  DEBUG   NetworkTopology 
Choosing random from 0 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[192.168.220.53:50010]
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning null
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
[
Node /default/192.168.220.53:50010 [
  Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 
8 > 0.0).
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning 192.168.220.53:50010
2:38:20.527 PM  INFOBlockPlacementPolicy
Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
2:38:20.527 PM  DEBUG   StateChange 
closeFile: 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
 with 1 blocks is persisted to the file system
2:38:20.527 PM  DEBUG   StateChange 
*BLOCK* NameNode.addBlock: file 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
Failed to choose from local rack (location = /default); the second replica is 
not found, retry choosing ramdomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

{code}
This part makes no sense at all:


{code:java}
load: 8 > 0.0{code}
 

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: 

[jira] [Created] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

2018-08-17 Thread Henrique Barros (JIRA)
Henrique Barros created HDFS-13833:
--

 Summary: Failed to choose from local rack (location = /default); 
the second replica is not found, retry choosing ramdomly
 Key: HDFS-13833
 URL: https://issues.apache.org/jira/browse/HDFS-13833
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Henrique Barros


I'm having a random problem with blocks replication with Hadoop 2.6.0-cdh5.15.0
With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21

 

In my case we are getting this error very randomly (after some hours) and with 
only one Datanode (for now, we are trying this cloudera cluster for a POC)
Here is the Log.
{code:java}
Choosing random from 1 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[]
2:38:20.527 PM  DEBUG   NetworkTopology 
Choosing random from 0 available nodes on node /default, scope=/default, 
excludedScope=null, excludeNodes=[192.168.220.53:50010]
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning null
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
[
Node /default/192.168.220.53:50010 [
  Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 
8 > 0.0).
2:38:20.527 PM  DEBUG   NetworkTopology 
chooseRandom returning 192.168.220.53:50010
2:38:20.527 PM  INFOBlockPlacementPolicy
Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
2:38:20.527 PM  DEBUG   StateChange 
closeFile: 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
 with 1 blocks is persisted to the file system
2:38:20.527 PM  DEBUG   StateChange 
*BLOCK* NameNode.addBlock: file 
/mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
2:38:20.527 PM  DEBUG   BlockPlacementPolicy
Failed to choose from local rack (location = /default); the second replica is 
not found, retry choosing ramdomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

{code}
This part makes no sense at all:


{code:java}
load: 8 > 0.0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)