[ https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277593#comment-14277593 ]
Colin Patrick McCabe commented on HDFS-7411: -------------------------------------------- DecomissionManager: Let's document the locking here. I believe all of these functions are designed to be run with the FSNamesystem write lock held, correct? A question about the data structures here. We already have a way of iterating through all the blocks on a given {{DataNode}} via the implicit linked lists in the {{BlockInfo}} objects. Shouldn't we be using that here, rather than creating our own list in {{decomNodeBlocks}}? This could be a lot of memory saved here, since datanodes might have anywhere between 200k and a million blocks in the next few years. (I am fine with doing this in a follow-on, of course.) {{DecomissionManager#startDecommission}}: let's have a debug log message here for the case where the node is already decommissioned... it might help with debugging. Similarly in {{stopDecomission}}... if we're not doing anything, let's log why we're not doing anything at debug or trace level. Config Keys: We talked about this a bit earlier but didn't really come to any consensus. I think we should just get rid of {{dfs.namenode.decommission.nodes.per.interval}} and {{dfs.namenode.decommission.blocks.per.node}}, and just have a configuration key like {{dfs.namenode.decommission.blocks.per.minute}} that expresses directly what we want. If we set a reasonable default for {{dfs.namenode.decommission.blocks.per.minute}}, the impact on users will be minimal. The old rate-limiting process was broken anyway... that's a big part of what this patch is designed to fix, as I understand. So we shouldn't need to lug around these old config keys that don't really express what we're trying to configure. Let's just add a new configuration key that is easy to understand, and maintain in the future. Decom is a manual process anywhere where admins get involved {{dfs.namenode.decommission.max.concurrent.tracked.nodes}}: I have mixed feelings about this configuration key. It seems like the reason you want to add it is to limit NN memory consumption (but you can do that by not duplicating data structures-- see above). However, it may have some value for users who would rather finish decomissioning 100 nodes in an hour, than have 1000 nodes 10% decomissioned in that time. So I guess it is a good addition, maybe? The only problem is that if some of those first 100 nodes get stuck because there is an open-for-write file or something, then the decom process will start to slow down and perhaps eventually hang. On a related note, I feel like we should have a follow-up change to make the decom parameters reconfigurable via the configuration reloading interface we added recently. I will file a follow-on JIRA for that. {code} if (LOG.isDebugEnabled()) { StringBuilder b = new StringBuilder("Node {} "); if (isHealthy) { b.append("is "); } else { b.append("isn't "); } b.append("and still needs to replicate {} more blocks, " + "decommissioning is still in progress."); {code} Missing "healthy " in the printout. Can we log this at info level like every half hour or something, so that people can see issues with nodes getting "stuck"? As it is, it seems like they'll get no output unless they monkey with log4j manually. {code} * Note also that the reference to the list of under-replicated blocks * will be null on initial addx {code} Spelling? {code} final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>> it = new CyclicIteration<DatanodeDescriptor, AbstractList<BlockInfo>>( decomNodeBlocks, iterkey).iterator(); {code} Well, this is creative, but I think I'd rather have the standard indentation :) > Refactor and improve decommissioning logic into DecommissionManager > ------------------------------------------------------------------- > > Key: HDFS-7411 > URL: https://issues.apache.org/jira/browse/HDFS-7411 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.5.1 > Reporter: Andrew Wang > Assignee: Andrew Wang > Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, > hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, > hdfs-7411.006.patch > > > Would be nice to split out decommission logic from DatanodeManager to > DecommissionManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)