[ http://issues.apache.org/jira/browse/HADOOP-432?page=comments#action_12458970 ] Yoram Arnon commented on HADOOP-432: ------------------------------------
* I just ran a 'time hadoop dfs -lsr / > /dev/null' on our dfs from a client; it took 3:38 minutes real time, 1:40 minutes user+system time on the client, consuming 20-30% cpu on both client and namenode. Depending on the number of files deleted, doing this every few minutes is expensive. For comparison, 'time hadoop dfs -du /' takes 2 seconds (<1 second user+system), so the cost of delivering the paths from the namenode to the client is the expensive part, and internal implementation in the namenode is cheap. I repeated this locally on the namenode, where dfs -lsr took 3:45/0:45 minutes, so the network is not all to blame. * data in the trash is arranged in its original folder layout, to enable a person to locate her files and restore them. Creating a folder for every X minutes (how many?) will make restoring a file harder. * an external process reclaiming space needs to be monitored, otherwise files will accumulate in the trash and the dfs will fill up. This could be achieved by a cron job, but then the admin is required to do one extra step to set up dfs, or the namenode could fork off a clean-up process. I'm 80% for the performance and simplicity an internal thread, 20% for the safety of an external cleanup. What do others think? > support undelete, snapshots, or other mechanism to recover lost files > --------------------------------------------------------------------- > > Key: HADOOP-432 > URL: http://issues.apache.org/jira/browse/HADOOP-432 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Reporter: Yoram Arnon > Assigned To: Wendy Chien > Attachments: undelete12.patch, undelete16.patch, undelete17.patch > > > currently, once you delete a file it's gone forever. > most file systems allow some form of recovery of deleted files. > a simple solution would be an 'undelete' command. > a more comprehensive solution would include snapshots, manual and automatic, > with scheduling options. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira