[jira] [Commented] (HDFS-3599) Better expose when under-construction files are preventing DN decommission

Andrew Wang (JIRA) Wed, 06 Jan 2016 13:16:25 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086285#comment-15086285
 ]


Andrew Wang commented on HDFS-3599:
-----------------------------------

I think that same check is still in DecommissionManager#isSufficient:

{code}
      if (bc.isUnderConstruction() && block.equals(bc.getLastBlock())) {
        // Can decom a UC block as long as there will still be minReplicas
        if (blockManager.hasMinStorage(block, numLive)) {
          LOG.trace("UC block {} sufficiently-replicated since numLive ({}) "
              + ">= minR ({})", block, numLive,
              blockManager.getMinStorageNum(block));
          return true;
{code}

Looking at the HDFS-7411 diff, it did not change the unit test introduced by 
HDFS-5579 so I think it was carried over correctly.

The high-level point is that open files block decommission. If you try to 
decommission the 3 nodes that are writing the 3 replicas of a block, we can't 
drop below minReplication and still be able to complete the block. So, 
decommission will wait on 3-minRep of the nodes.

DecommissionManager right now has tons of debug/trace prints with these kinds 
of issues. It'd be good to expose this as a metric or something, so it can be 
easily queried by admins.

That, or we solve it once and for all by actively re-routing clients away from 
decommissioning nodes. There are a number of ideas for how we might do this.

> Better expose when under-construction files are preventing DN decommission
> --------------------------------------------------------------------------
>
>                 Key: HDFS-3599
>                 URL: https://issues.apache.org/jira/browse/HDFS-3599
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Zhe Zhang
>
> Filing on behalf of Konstantin Olchanski:
> {quote}
> I have been trying to decommission a data node, but the process
> stalled. I followed the correct instructions, observed my node
> listed in "Decommissioning Nodes", etc, observed "Under Replicated Blocks"
> decrease, etc. But the count went down to "1" and the decommissin process 
> stalled.
> There was no visible activity anywhere, nothing was happening (well,
> maybe in some hidden log file somewhere something complained,
> but I did not look).
> It turns out that I had some files stuck in "OPENFORWRITE" mode,
> as reported by "hdfs fsck / -openforwrite -files -blocks -locations -racks":
> {code}
> /users/trinat/data/.fuse_hidden0000177e00000002 0 bytes, 0 block(s), 
> OPENFORWRITE:  OK
> /users/trinat/data/.fuse_hidden0000178d00000003 0 bytes, 0 block(s), 
> OPENFORWRITE:  OK
> /users/trinat/data/.fuse_hidden00001da300000004 0 bytes, 1 block(s), 
> OPENFORWRITE:  OK
> 0. 
> BP-88378204-142.90.119.126-1340494203431:blk_6980480609696383665_20259{blockUCState=UNDER_CONSTRUCTION,
>  primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[142.90.111.72:50010|RBW], 
> ReplicaUnderConstruction[142.90.119.162:50010|RBW], 
> ReplicaUnderConstruction[142.90.119.126:50010|RBW]]} len=0 repl=3 
> [/detfac/142.90.111.72:50010, /isac2/142.90.119.162:50010, 
> /isac2/142.90.119.126:50010]
> {code}
> After I deleted those files, the decommission process completed successfully.
> Perhaps one can add some visible indication somewhere on the HDFS status web 
> page
> that the decommission process is stalled and maybe report why it is stalled?
> Maybe the number of "OPENFORWRITE" files should be listed on the status page
> next to the "Number of Under-Replicated Blocks"? (Since I know that nobody is 
> writing
> to my HDFS, the non-zero count would give me a clue that something is wrong).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-3599) Better expose when under-construction files are preventing DN decommission

Reply via email to