[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

Colin Patrick McCabe (JIRA) Wed, 14 Jan 2015 12:12:52 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277593#comment-14277593
 ]


Colin Patrick McCabe commented on HDFS-7411:
--------------------------------------------

DecomissionManager:

Let's document the locking here.  I believe all of these functions are designed 
to be run with the FSNamesystem write lock held, correct?

A question about the data structures here.  We already have a way of iterating 
through all the blocks on a given {{DataNode}} via the implicit linked lists in 
the {{BlockInfo}} objects.  Shouldn't we be using that here, rather than 
creating our own list in {{decomNodeBlocks}}?  This could be a lot of memory 
saved here, since datanodes might have anywhere between 200k and a million 
blocks in the next few years.  (I am fine with doing this in a follow-on, of 
course.)

{{DecomissionManager#startDecommission}}: let's have a debug log message here 
for the case where the node is already decommissioned... it might help with 
debugging.  Similarly in {{stopDecomission}}... if we're not doing anything, 
let's log why we're not doing anything at debug or trace level.

Config Keys:

We talked about this a bit earlier but didn't really come to any consensus.  I 
think we should just get rid of 
{{dfs.namenode.decommission.nodes.per.interval}} and 
{{dfs.namenode.decommission.blocks.per.node}}, and just have a configuration 
key like {{dfs.namenode.decommission.blocks.per.minute}} that expresses 
directly what we want.

If we set a reasonable default for 
{{dfs.namenode.decommission.blocks.per.minute}}, the impact on users will be 
minimal.  The old rate-limiting process was broken anyway... that's a big part 
of what this patch is designed to fix, as I understand.  So we shouldn't need 
to lug around these old config keys that don't really express what we're trying 
to configure.  Let's just add a new configuration key that is easy to 
understand, and maintain in the future.  Decom is a manual process anywhere 
where admins get involved

{{dfs.namenode.decommission.max.concurrent.tracked.nodes}}: I have mixed 
feelings about this configuration key.  It seems like the reason you want to 
add it is to limit NN memory consumption (but you can do that by not 
duplicating data structures-- see above).  However, it may have some value for 
users who would rather finish decomissioning 100 nodes in an hour, than have 
1000 nodes 10% decomissioned in that time.  So I guess it is a good addition, 
maybe?  The only problem is that if some of those first 100 nodes get stuck 
because there is an open-for-write file or something, then the decom process 
will start to slow down and perhaps eventually hang.

On a related note, I feel like we should have a follow-up change to make the 
decom parameters reconfigurable via the configuration reloading interface we 
added recently.  I will file a follow-on JIRA for that.

{code}
            if (LOG.isDebugEnabled()) {
              StringBuilder b = new StringBuilder("Node {} ");
              if (isHealthy) {
                b.append("is ");
              } else {
                b.append("isn't ");
              }
              b.append("and still needs to replicate {} more blocks, " +
                  "decommissioning is still in progress.");
{code}
Missing "healthy " in the printout.

Can we log this at info level like every half hour or something, so that people 
can see issues with nodes getting "stuck"?  As it is, it seems like they'll get 
no output unless they monkey with log4j manually.

{code}
   * Note also that the reference to the list of under-replicated blocks 
   * will be null on initial addx
{code}
Spelling?

{code}
      final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>>    
                                          
          it =
          new CyclicIteration<DatanodeDescriptor, AbstractList<BlockInfo>>(     
                                          
              decomNodeBlocks, iterkey).iterator();                             
                                          
{code}
Well, this is creative, but I think I'd rather have the standard indentation :)

> Refactor and improve decommissioning logic into DecommissionManager
> -------------------------------------------------------------------
>
>                 Key: HDFS-7411
>                 URL: https://issues.apache.org/jira/browse/HDFS-7411
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.5.1
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

Reply via email to