[ https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308105#comment-14308105 ]
Andrew Wang commented on HDFS-7411: ----------------------------------- I had an offline request to summarize some of the above. Nicholas's compatibility concern regards the rate limiting of decommissioning. Currently, this is expressed as a number of nodes to process per decom manager wakeup. There are a number of flaws with this scheme: * Since the decom manager iterates over the whole datanode list, both live and decommissioning nodes count towards the limit. Thus, the actual number of decomming nodes processed varies between 0 and the limit. * Since datanodes have different number of blocks, the amount of actual work can vary based on this as well. This means: * This config parameter only very loosely corresponds to decom rate and decom pause times, which are the two things that admins care about. * Trying to tune decom behavior with this parameter is thus somewhat futile. * In the grand scope of HDFS, this is also not a common parameter to be tweaked. Because this, we felt it was okay to change the interpretation of this config option. I view the old behavior more as a bug than something that is being depended upon by a user. Translating this number of nodes limit instead into a number of blocks limit (as done in the current patch) makes the config far more predictable and thus usable. Since the new code also supports incremental scans (which is what makes it faster), specifying the limit in a number of nodes limit doesn't make much sense. The only potential surprise I see for cluster operators is if the translation of the limit from {{# nodes}} to {{# blocks}} is too liberal. This would result in longer maximum pause times than before. We thought 100k nodes per block was a conservative estimate, but this could be further reduced. One avenue I do not want to pursue is keeping the old code around, as Nicholas has proposed. This increases our maintenance burden, and means many people will keep running into the same issues surrounding decom. If Nicholas still does not agree with the above rationale, I see the following potential options for improvement: * Be even more conservative with translation factor, e.g. assume only 50k blocks per node * Factor in the number of nodes and/or avg blocks per node to the translation. This will better approximate the old average pause times. * Make the new decom manager also support a {{# nodes}} limit. This isn't great since scans are incremental now, but it means we'll be doing strictly less work per pause than before. > Refactor and improve decommissioning logic into DecommissionManager > ------------------------------------------------------------------- > > Key: HDFS-7411 > URL: https://issues.apache.org/jira/browse/HDFS-7411 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.5.1 > Reporter: Andrew Wang > Assignee: Andrew Wang > Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, > hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, > hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, > hdfs-7411.009.patch, hdfs-7411.010.patch > > > Would be nice to split out decommission logic from DatanodeManager to > DecommissionManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)