[
https://issues.apache.org/jira/browse/HDFS-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058496#comment-14058496
]
Rafal Wojdyla commented on HDFS-6621:
-------------------------------------
We have also experience this problem with Balancer.
The problem in general is that balancer will prematurely finish iteration due
to noPendingBlockIteration >= 5.
I was about to create JIRA ticket for this - but I have noticed this ticket.
The solutions that, we have applied is to:
1. noPendingBlockIteration = 0 when pendingBlock != null, exactly the way you
did
2. notify only on source object when block transfer finishes
Problem/Solutions 1 is well described above.
Problem/Solutions 2:
In org/apache/hadoop/hdfs/server/balancer/Balancer:
{code}
private void dispatch() {
...
synchronized (Balancer.this) {
Balancer.this.notifyAll();
}
}
{code}
this will notify all scheduling threads, even the ones that are waiting and
still have all 5 transfer threads occupied.
When occupied task wakes up, it will try to get next block to move, but because
all 5 transfer threads are occupied
it will get null as next block to move - which will increase
noPendingBlockIteration, and we are in the problem 1.
The solution is to notify threads waiting on source object and reset
PendingBlockMove object afterwords.
Should I provide patch in this ticket, or create a separate ticket?
> Hadoop Balancer prematurely exits iterations
> --------------------------------------------
>
> Key: HDFS-6621
> URL: https://issues.apache.org/jira/browse/HDFS-6621
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer
> Affects Versions: 2.2.0, 2.4.0
> Environment: Red Hat Enterprise Linux Server release 5.8 with Hadoop
> 2.4.0
> Reporter: Benjamin Bowman
> Labels: balancer
> Attachments: HDFS-6621.patch
>
>
> I have been having an issue with the balancing being too slow. The issue was
> not with the speed with which blocks were moved, but rather the balancer
> would prematurely exit out of it's balancing iterations. It would move ~10
> blocks or 100 MB then exit the current iteration (in which it said it was
> planning on moving about 10 GB).
> I looked in the Balancer.java code and believe I found and solved the issue.
> In the dispatchBlocks() function there is a variable,
> "noPendingBlockIteration", which counts the number of iterations in which a
> pending block to move cannot be found. Once this number gets to 5, the
> balancer exits the overall balancing iteration. I believe the desired
> functionality is 5 consecutive no pending block iterations - however this
> variable is never reset to 0 upon block moves. So once this number reaches 5
> - even if there have been thousands of blocks moved in between these no
> pending block iterations - the overall balancing iteration will prematurely
> end.
> The fix I applied was to set noPendingBlockIteration = 0 when a pending block
> is found and scheduled. In this way, my iterations do not prematurely exit
> unless there is 5 consecutive no pending block iterations. Below is a copy
> of my dispatchBlocks() function with the change I made.
> private void dispatchBlocks() {
> long startTime = Time.now();
> long scheduledSize = getScheduledSize();
> this.blocksToReceive = 2*scheduledSize;
> boolean isTimeUp = false;
> int noPendingBlockIteration = 0;
> while(!isTimeUp && getScheduledSize()>0 &&
> (!srcBlockList.isEmpty() || blocksToReceive>0)) {
> PendingBlockMove pendingBlock = chooseNextBlockToMove();
> if (pendingBlock != null) {
> noPendingBlockIteration = 0;
> // move the block
> pendingBlock.scheduleBlockMove();
> continue;
> }
> /* Since we can not schedule any block to move,
> * filter any moved blocks from the source block list and
> * check if we should fetch more blocks from the namenode
> */
> filterMovedBlocks(); // filter already moved blocks
> if (shouldFetchMoreBlocks()) {
> // fetch new blocks
> try {
> blocksToReceive -= getBlockList();
> continue;
> } catch (IOException e) {
> LOG.warn("Exception while getting block list", e);
> return;
> }
> } else {
> // source node cannot find a pendingBlockToMove, iteration +1
> noPendingBlockIteration++;
> // in case no blocks can be moved for source node's task,
> // jump out of while-loop after 5 iterations.
> if (noPendingBlockIteration >= MAX_NO_PENDING_BLOCK_ITERATIONS) {
> setScheduledSize(0);
> }
> }
> // check if time is up or not
> if (Time.now()-startTime > MAX_ITERATION_TIME) {
> isTimeUp = true;
> continue;
> }
> /* Now we can not schedule any block to move and there are
> * no new blocks added to the source block list, so we wait.
> */
> try {
> synchronized(Balancer.this) {
> Balancer.this.wait(1000); // wait for targets/sources to be idle
> }
> } catch (InterruptedException ignored) {
> }
> }
> }
> }
--
This message was sent by Atlassian JIRA
(v6.2#6252)