[
https://issues.apache.org/jira/browse/HDFS-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296254#comment-16296254
]
Xiao Chen commented on HDFS-11847:
----------------------------------
Thanks for working on this Manoj! This will be a nice tool for troubleshooting
decommissioning.
Some comments:
- Since HDFS-10480 is released, we cannot change the APIs unfortunately. It
seems to me we'd have to provide an overload of {{listOpenFiles}}. I like the
use of enums though, maybe we should deprecate the existing API to encourage
the new API to always be used.
- From API, do we support {{BLOCKING_DECOMMISSION}} and {{ALL_OPEN_FILES}} both
specified? Implementation in {{FSN#listOpenFiles}} doesn't look like so, but
I'm also wondering how we plan to support them on the same
{{OpenFilesIterator}}. Do we want to have types on {{OpenFileEntry}}?
- Usage perspective, it may also be useful if we print out DataNodes.
- {{DatanodeAdminManager#processBlocksInternal}}, maybe we can skip if a block
and inode is inconsistent instead of throw from preconditions? Could log in NN
to help debugging, but from hdfsadmin we can still see other openfiles.
- {{DatanodeAdminManager#processBlocksInternal}}, can we simply use
{{lowRedundancyOpenFiles.size()}} and get rid of
{{lowRedundancyBlocksInOpenFiles}}?
- {{LeavingServiceStatus}} similar to above, do we need both the counter and
the set of openfiles?
(Holding all inode id would consume more memory, but since this only happens
when decommissioning + open files, which hopefully would be a tiny portion of
all files, I think we're okay)
Nits:
- {{LeavingServiceStatus}} trivial and pre-existing: comment at the end of this
class should say {{End of class LeavingServiceStatus}}, not
{{DecommissioningStatus}}
- {{FSN#getFilesBlockingDecom}} suggest to add {{assert hasReadLock();}} to
safeguard future changes
- {{TestDecommission#verifyOpenFilesBlockingDecommission}}: Should save the
previous {{System.out}} as a local var, and set back when we're done.
{{System.setOut(System.out);}} won't restore to the old out. Also the restore
logic should be in a finally block.
- {{TestDecommission}}, can we set the
{{DFS_NAMENODE_REDUNDANCY_INTERVAL_SECONDS_KEY}} as {{Integer.MAX_VALUE}}?
1-second may not be robust enough.
> Enhance dfsadmin listOpenFiles command to list files blocking datanode
> decommissioning
> --------------------------------------------------------------------------------------
>
> Key: HDFS-11847
> URL: https://issues.apache.org/jira/browse/HDFS-11847
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Manoj Govindassamy
> Attachments: HDFS-11847.01.patch, HDFS-11847.02.patch
>
>
> HDFS-10480 adds {{listOpenFiles}} option is to {{dfsadmin}} command to list
> all the open files in the system.
> Additionally, it would be very useful to only list open files that are
> blocking the DataNode decommissioning. With thousand+ node clusters, where
> there might be machines added and removed regularly for maintenance, any
> option to monitor and debug decommissioning status is very helpful. Proposal
> here is to add suboptions to {{listOpenFiles}} for the above case.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]