[
https://issues.apache.org/jira/browse/HDFS-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Chen updated HDFS-10320:
-----------------------------
Labels: supportability (was: )
> Rack failures may result in NN terminate
> ----------------------------------------
>
> Key: HDFS-10320
> URL: https://issues.apache.org/jira/browse/HDFS-10320
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Labels: supportability
> Fix For: 2.8.0
>
> Attachments: HDFS-10320.01.patch, HDFS-10320.02.patch,
> HDFS-10320.03.patch, HDFS-10320.04.patch, HDFS-10320.05.patch,
> HDFS-10320.06.patch
>
>
> If there're rack failures which end up leaving only 1 rack available,
> {{BlockPlacementPolicyDefault#chooseRandom}} may get
> {{InvalidTopologyException}} when calling {{NetworkTopology#chooseRandom}},
> which then throws all the way out to {{BlockManager}}'s
> {{ReplicationMonitor}} thread and terminate the NN.
> Log:
> {noformat}
> 2016-02-24 09:22:01,514 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to
> place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[],
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For
> more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-02-24 09:22:01,958 ERROR
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
> ReplicationMonitor thread received Runtime exception.
> org.apache.hadoop.net.NetworkTopology$InvalidTopologyException: Failed to
> find datanode (scope="" excludedScope="/rack_a5").
> at
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:729)
> at
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:694)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:635)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:348)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]