[ 
https://issues.apache.org/jira/browse/HDFS-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296254#comment-16296254
 ] 

Xiao Chen commented on HDFS-11847:
----------------------------------

Thanks for working on this Manoj! This will be a nice tool for troubleshooting 
decommissioning.

Some comments:
- Since HDFS-10480 is released, we cannot change the APIs unfortunately. It 
seems to me we'd have to provide an overload of {{listOpenFiles}}. I like the 
use of enums though, maybe we should deprecate the existing API to encourage 
the new API to always be used.
- From API, do we support {{BLOCKING_DECOMMISSION}} and {{ALL_OPEN_FILES}} both 
specified? Implementation in {{FSN#listOpenFiles}} doesn't look like so, but 
I'm also wondering how we plan to support them on the same 
{{OpenFilesIterator}}. Do we want to have types on {{OpenFileEntry}}?
- Usage perspective, it may also be useful if we print out DataNodes.
- {{DatanodeAdminManager#processBlocksInternal}}, maybe we can skip if a block 
and inode is inconsistent instead of throw from preconditions? Could log in NN 
to help debugging, but from hdfsadmin we can still see other openfiles.
- {{DatanodeAdminManager#processBlocksInternal}}, can we simply use 
{{lowRedundancyOpenFiles.size()}} and get rid of 
{{lowRedundancyBlocksInOpenFiles}}?
- {{LeavingServiceStatus}} similar to above, do we need both the counter and 
the set of openfiles?
(Holding all inode id would consume more memory, but since this only happens 
when decommissioning + open files, which hopefully would be a tiny portion of 
all files, I think we're okay)

Nits:
- {{LeavingServiceStatus}} trivial and pre-existing: comment at the end of this 
class should say {{End of class LeavingServiceStatus}}, not 
{{DecommissioningStatus}}
- {{FSN#getFilesBlockingDecom}} suggest to add {{assert hasReadLock();}} to 
safeguard future changes
- {{TestDecommission#verifyOpenFilesBlockingDecommission}}: Should save the 
previous {{System.out}} as a local var, and set back when we're done. 
{{System.setOut(System.out);}} won't restore to the old out. Also the restore 
logic should be in a finally block.
- {{TestDecommission}}, can we set the 
{{DFS_NAMENODE_REDUNDANCY_INTERVAL_SECONDS_KEY}} as {{Integer.MAX_VALUE}}? 
1-second may not be robust enough.

> Enhance dfsadmin listOpenFiles command to list files blocking datanode 
> decommissioning
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-11847
>                 URL: https://issues.apache.org/jira/browse/HDFS-11847
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-11847.01.patch, HDFS-11847.02.patch
>
>
> HDFS-10480 adds {{listOpenFiles}} option is to {{dfsadmin}} command to list 
> all the open files in the system.
> Additionally, it would be very useful to only list open files that are 
> blocking the DataNode decommissioning. With thousand+ node clusters, where 
> there might be machines added and removed regularly for maintenance, any 
> option to monitor and debug decommissioning status is very helpful. Proposal 
> here is to add suboptions to {{listOpenFiles}} for the above case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to