[jira] [Commented] (HDFS-10808) DiskBalancer does not execute multi-steps plan-redux

Anu Engineer (JIRA) Tue, 30 Aug 2016 18:15:41 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450677#comment-15450677
 ]


Anu Engineer commented on HDFS-10808:
-------------------------------------

[~eddyxu] Thank you very much for taking time out to look at the code and ask 
me these pertinent questions.

bq. Wouldn't break is the simplest way to exit the while loop? 
if you had one or two breaks may be. But as the number of breaks increase, the 
control flow graph becomes quite complex. So it becomes harder to reason about 
the resources and other states from each exit point.  An easy way of thinking 
about copyBlocks would be to think of it as a simple state machine. if you were 
implementing a state machine, you would probably move to state "exit" and use 
the default mechanisms of the state machine to handle the exit state instead of 
breaking out at each error condition. copyBlocks is following that pattern. So 
while in generic case I agree with you, in this specific case I think this 
pattern produces code that is easier to reason.

bq.  on the comment.
{noformat}
// check if someone told us exit, treat this as an interruption
// point for the thread, since both getNextBlock and moveBlocAcrossVolume
// can take some time.
{noformat}
I was under the impression that comment is quite clear, may be I am mistaken.

There are several conditions under which we would like to exit in the copy 
blocks thread. Some of them are states. Some are actions with clear side 
effects. What we are trying to do is minimize the effects of both. So we 
introduce the notion of "interruption points" in our copy thread. That is when 
we invoke a function and if we encounter a failure condition, we flag that 
information so that at the next safe point to bail, we will. In other words, we 
don't exit at the point of error, but simply set the state so that thread can 
proceed to a point where it considers that it is safe for it to exit.

Examples of action with side effects are, copy of data block but metadata is 
not still copied or getting a bunch of disk errors (we wait till 5) before we 
can get out etc, or finding a block and before we can get to it, it disappears 
underneath us. Since we have all these kinds of external conditions to take 
care of, we simply set up a flag telling the system to exit cleanly. This 
paradigm gives us a centralized exit handler, so if the thread had to do some 
specific cleanup based on certain error, it is still possible to chain those 
error handlers at the exit point.

Yes, the Atomic nature of shouldRun flag is confusing and perhaps not needed. 
It is an artifact of playing around with copying multiple blocks when I was 
developing code. It had a different structure, but then I found that enforcing 
bandwidth was harder and decided to do single block copy at a time.

I really appreciate you taking time out to ask these questions and helping to 
make sure that I am in the right path.


 



> DiskBalancer does not execute multi-steps plan-redux
> ----------------------------------------------------
>
>                 Key: HDFS-10808
>                 URL: https://issues.apache.org/jira/browse/HDFS-10808
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: balancer & mover
>            Reporter: Anu Engineer
>            Assignee: Anu Engineer
>         Attachments: HDFS-10808.001.patch, HDFS-10808.002.patch
>
>
> This is to redo of the fix in HDFS-10598



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10808) DiskBalancer does not execute multi-steps plan-redux

Reply via email to