[
https://issues.apache.org/jira/browse/HDFS-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677490#comment-13677490
]
Kihwal Lee commented on HDFS-4376:
----------------------------------
It doesn't look like a test issue. In my test run, it decided to move 90 B, but
only 80 B were moved.
Two blocks were to be moved from 32932 to 49691, but only one was moved.
{panel}
2013-06-06 14:09:16,683 INFO balancer.Balancer (Balancer.java:run(1429)) -
namenodes = [hdfs://localhost:59795]
2013-06-06 14:09:16,684 INFO balancer.Balancer (Balancer.java:run(1430)) - p
= Balancer.Parameters[BalancingPolicy.Node, threshold=10.0]
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
Bytes Being Moved
2013-06-06 14:09:16,729 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(6143)) - allowed=true ugi=kihwal (auth:SIMPLE)
ip=/127.0.0.1 cmd=create src=/system/balancer.id dst=null
perm=kihwal:supergroup:rw-r--r--
2013-06-06 14:09:16,733 INFO hdfs.StateChange
(FSNamesystem.java:saveAllocatedBlock(2656)) - BLOCK* allocateBlock:
/system/balancer.id. BP-818264213-10.74.91.177-1370545753908
blk_-7705900514265277819_10
21 blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[127.0.0.1:49691|RBW],
ReplicaUnderConstruction[127.0.0.1:34270|RBW],
ReplicaUnderConstruction[127.0.0.1:45599|RB
W]]
2013-06-06 14:09:16,735 INFO net.NetworkTopology
(NetworkTopologyWithNodeGroup.java:add(205)) - Adding a new node:
/rack0/nodegroup0/127.0.0.1:54778
2013-06-06 14:09:16,736 INFO datanode.DataNode
(DataXceiver.java:writeBlock(429)) - Receiving
BP-818264213-10.74.91.177-1370545753908:blk_-7705900514265277819_1021 src:
/127.0.0.1:34183 dest: /127.0.0.1:
49691
2013-06-06 14:09:16,736 INFO net.NetworkTopology
(NetworkTopologyWithNodeGroup.java:add(205)) - Adding a new node:
/rack1/nodegroup2/127.0.0.1:49691
2013-06-06 14:09:16,737 INFO net.NetworkTopology
(NetworkTopologyWithNodeGroup.java:add(205)) - Adding a new node:
/rack0/nodegroup0/127.0.0.1:45599
2013-06-06 14:09:16,737 WARN datanode.DataNode
(BlockReceiver.java:<init>(213)) - Could not get file descriptor for
outputstream of class class
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$S
imulatedOutputStream
2013-06-06 14:09:16,738 INFO net.NetworkTopology
(NetworkTopologyWithNodeGroup.java:add(205)) - Adding a new node:
/rack1/nodegroup1/127.0.0.1:34270
2013-06-06 14:09:16,739 INFO net.NetworkTopology
(NetworkTopologyWithNodeGroup.java:add(205)) - Adding a new node:
/rack1/nodegroup2/127.0.0.1:32932
2013-06-06 14:09:16,739 INFO balancer.Balancer (Balancer.java:logNodes(933)) -
1 over-utilized: [Source[127.0.0.1:45599, utilization=40.0]]
2013-06-06 14:09:16,740 INFO balancer.Balancer (Balancer.java:logNodes(933)) -
2 underutilized: [BalancerDatanode[127.0.0.1:54778, utilization=0.0],
BalancerDatanode[127.0.0.1:49691, utilization=0.0]]
2013-06-06 14:09:16,741 INFO balancer.Balancer (Balancer.java:run(1365)) -
Need to move 70 B to make the cluster balanced.
2013-06-06 14:09:16,742 INFO balancer.Balancer
(Balancer.java:matchSourceWithTargetToMove(1067)) - Decided to move 50 B bytes
from 127.0.0.1:45599 to 127.0.0.1:54778
2013-06-06 14:09:16,742 INFO datanode.DataNode
(DataXceiver.java:writeBlock(429)) - Receiving
BP-818264213-10.74.91.177-1370545753908:blk_-7705900514265277819_1021 src:
/127.0.0.1:58416 dest: /127.0.0.1:
45599
2013-06-06 14:09:16,743 WARN datanode.DataNode
(BlockReceiver.java:<init>(213)) - Could not get file descriptor for
outputstream of class class
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$S
imulatedOutputStream
2013-06-06 14:09:16,742 INFO balancer.Balancer
(Balancer.java:matchSourceWithTargetToMove(1067)) - Decided to move 20 B bytes
from 127.0.0.1:32932 to 127.0.0.1:49691
2013-06-06 14:09:16,744 INFO balancer.Balancer
(Balancer.java:matchSourceWithTargetToMove(1067)) - Decided to move 20 B bytes
from 127.0.0.1:34270 to 127.0.0.1:49691
2013-06-06 14:09:16,744 INFO balancer.Balancer (Balancer.java:run(1379)) -
Will move 90 B in this iteration
...
0 B 70 B 90 B
...
2013-06-06 14:09:16,790 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 4026872562939698188 from 127.0.0.1:45599 to 127.0.0.1:54778
through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,796 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 7808894185501941486 from 127.0.0.1:45599 to 127.0.0.1:54778
through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,796 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 15281409689851962 from 127.0.0.1:45599 to 127.0.0.1:54778 through
127.0.0.1:32932 is succeeded.
2013-06-06 14:09:16,797 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 116703018151698955 from 127.0.0.1:34270 to 127.0.0.1:49691 through
127.0.0.1:34270 is succeeded.
2013-06-06 14:09:16,793 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 2721708609550631379 from 127.0.0.1:45599 to 127.0.0.1:54778
through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,799 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block -7705900514265277819 from 127.0.0.1:45599 to 127.0.0.1:54778
through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,801 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block 3682833351233298286 from 127.0.0.1:34270 to 127.0.0.1:49691
through 127.0.0.1:34270 is succeeded.
2013-06-06 14:09:16,805 INFO balancer.Balancer (Balancer.java:dispatch(359)) -
Moving block -7496839095912394308 from 127.0.0.1:32932 to 127.0.0.1:49691
through 127.0.0.1:32932 is succeeded.
{panel}
If I enable debug logging in Balancer, it fails less frequently or fails
differently. In the following case, all moves were complete, but the final
utilization was wrong. The test itself didn't timeout.
{panel}
java.util.concurrent.TimeoutException: Rebalancing expected avg utilization to
become 0.16, but on datanode 127.0.0.1:39755 it remains at 0.3 after more than
20000 msec.
at
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.waitForBalancer(TestBalancerWithNodeGroup.java:149)
at
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runBalancer(TestBalancerWithNodeGroup.java:176)
at
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithNodeGroup(TestBalancerWithNodeGroup.java:300)
{panel}
> Intermittent timeout of TestBalancerWithNodeGroup
> -------------------------------------------------
>
> Key: HDFS-4376
> URL: https://issues.apache.org/jira/browse/HDFS-4376
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer, test
> Affects Versions: 2.0.3-alpha
> Reporter: Aaron T. Myers
> Assignee: Junping Du
> Priority: Minor
> Attachments: test-balancer-with-node-group-timeout.txt
>
>
> HDFS-4261 fixed several issues with the balancer and balancer tests, and
> reduced the frequency with which TestBalancerWithNodeGroup times out. Despite
> this, occasional timeouts still occur in this test. This JIRA is to track and
> fix this problem.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira