[
https://issues.apache.org/jira/browse/HDFS-13658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536996#comment-16536996
]
Andrew Wang commented on HDFS-13658:
------------------------------------
Hi Kitti, thanks for working on this! IIUC in your patch, it calls
updateOneReplicaBlocks in different places in BlockManager to track this
metric. However, don't we already have this metric in LowRedundancyBlock, via
the size of the highest priority queue? This would be an easy way of also
handling the EC case, since it uses the highest priority queue for minimally
durable blocks. Exposing the lengths of these different queues might be
interesting more generically, since it would give more detailed insight into NN
recovery activities. I'll also note that countNodes is a somewhat expensive
function, so it's not good to be calling it frequently in the BM.
A few other comments:
* ClientProtocol#getStats is deprecated so we shouldn't be putting new fields
there. I think getReplicatedBlockStats and getECBlockGroupStats are the correct
replacements. Similar for the new beans, there are Replicated and EC classes,
shouldn't go into NameNodeMXBean.
* Do we need the fsck changes? fsck already shows the number of
under-replicated blocks, which is a very similar sign that the cluster is not
healthy. If an admin isn't seeing the existing fsck metric, they aren't going
to see this one either. This would save us making the protocol changes, if
we're just exposing new NN metrics.
> fsck, dfsadmin -report, and NN WebUI should report number of blocks that have
> 1 replica
> ---------------------------------------------------------------------------------------
>
> Key: HDFS-13658
> URL: https://issues.apache.org/jira/browse/HDFS-13658
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 3.1.0
> Reporter: Kitti Nanasi
> Assignee: Kitti Nanasi
> Priority: Major
> Attachments: HDFS-13658.001.patch, HDFS-13658.002.patch,
> HDFS-13658.003.patch, HDFS-13658.004.patch, HDFS-13658.005.patch,
> HDFS-13658.006.patch, HDFS-13658.007.patch
>
>
> fsck, dfsadmin -report, and NN WebUI should report number of blocks that have
> 1 replica. We have had many cases opened in which a customer has lost a disk
> or a DN losing files/blocks due to the fact that they had blocks with only 1
> replica. We need to make the customer better aware of this situation and that
> they should take action.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]