[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308105#comment-14308105
 ] 

Andrew Wang commented on HDFS-7411:
-----------------------------------

I had an offline request to summarize some of the above.

Nicholas's compatibility concern regards the rate limiting of decommissioning. 
Currently, this is expressed as a number of nodes to process per decom manager 
wakeup. There are a number of flaws with this scheme:

* Since the decom manager iterates over the whole datanode list, both live and 
decommissioning nodes count towards the limit. Thus, the actual number of 
decomming nodes processed varies between 0 and the limit.
* Since datanodes have different number of blocks, the amount of actual work 
can vary based on this as well.

This means:
* This config parameter only very loosely corresponds to decom rate and decom 
pause times, which are the two things that admins care about.
* Trying to tune decom behavior with this parameter is thus somewhat futile.
* In the grand scope of HDFS, this is also not a common parameter to be tweaked.

Because this, we felt it was okay to change the interpretation of this config 
option. I view the old behavior more as a bug than something that is being 
depended upon by a user.

Translating this number of nodes limit instead into a number of blocks limit 
(as done in the current patch) makes the config far more predictable and thus 
usable. Since the new code also supports incremental scans (which is what makes 
it faster), specifying the limit in a number of nodes limit doesn't make much 
sense.

The only potential surprise I see for cluster operators is if the translation 
of the limit from {{# nodes}} to {{# blocks}} is too liberal. This would result 
in longer maximum pause times than before. We thought 100k nodes per block was 
a conservative estimate, but this could be further reduced.

One avenue I do not want to pursue is keeping the old code around, as Nicholas 
has proposed. This increases our maintenance burden, and means many people will 
keep running into the same issues surrounding decom.

If Nicholas still does not agree with the above rationale, I see the following 
potential options for improvement:

* Be even more conservative with translation factor, e.g. assume only 50k 
blocks per node
* Factor in the number of nodes and/or avg blocks per node to the translation. 
This will better approximate the old average pause times.
* Make the new decom manager also support a {{# nodes}} limit. This isn't great 
since scans are incremental now, but it means we'll be doing strictly less work 
per pause than before.

> Refactor and improve decommissioning logic into DecommissionManager
> -------------------------------------------------------------------
>
>                 Key: HDFS-7411
>                 URL: https://issues.apache.org/jira/browse/HDFS-7411
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.5.1
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to