Andrew Wang commented on HDFS-10999:

Thanks for the insight Allen,

M: "How long for recovery?"
A: "No idea. The NN doesn't tell me if these are EC blocks or regular blocks 
that were lost and one is faster to recover than the other."

That's what I was getting at with the pendingReconstructionBlocksCount. If we 
fix it as I talked about above, it'd actually tell you how much work is 
remaining, and how fast that work is progressing.

...I've also used it during system recovery and migrations as a measurement of 
how many more DNs I need to bring up such that I have more sources for block 

Would the "pending" queue metrics also work for this? We can also look at 
improved DN-side metrics related to replication work.

This number represents something that I as an admin have some semblance of 
control over: I could always manually copy blocks from one node to another to 
speed things up.
Under EC, I don't know of anything manual I can do if it is missing chunks of 

I really, really hope that manually copying blocks around is not a normal part 
of operating an HDFS cluster.

Point is still valid though, maybe we should take a harder look at the recovery 
work throttles on the NN and DN, and make them dynamically reconfigurable if 
they aren't. I recall seeing some customer issues where we temporarily bumped 
up these values to more quickly recover from failures.

> Use more generic "low redundancy" blocks instead of "under replicated" blocks
> -----------------------------------------------------------------------------
>                 Key: HDFS-10999
>                 URL: https://issues.apache.org/jira/browse/HDFS-10999
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Wei-Chiu Chuang
>            Assignee: Yuanbo Liu
>              Labels: supportability
> Per HDFS-9857, it seems in the Hadoop 3 world, people prefer the more generic 
> term "low redundancy" to the old-fashioned "under replicated". But this term 
> is still being used in messages in several places, such as web ui, dfsadmin 
> and fsck. We should probably change them to avoid confusion.
> File this jira to discuss it.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to