[ 
http://issues.apache.org/jira/browse/HADOOP-432?page=comments#action_12458970 ] 
            
Yoram Arnon commented on HADOOP-432:
------------------------------------

* I just ran a 'time hadoop dfs -lsr / > /dev/null' on our dfs from a client; 
it took 3:38 minutes real time, 1:40 minutes user+system time on the client, 
consuming 20-30% cpu on both client and namenode. Depending on the number of 
files deleted, doing this every few minutes is expensive. For comparison, 'time 
hadoop dfs -du /' takes 2 seconds (<1 second user+system), so the cost of 
delivering the paths from the namenode to the client is the expensive part, and 
internal implementation in the namenode is cheap.
I repeated this locally on the namenode, where dfs -lsr took 3:45/0:45 minutes, 
so the network is not all to blame.

* data in the trash is arranged in its original folder layout, to enable a 
person to locate her files and restore them. Creating a folder for every X 
minutes (how many?) will make restoring a file harder.

* an external process reclaiming space needs to be monitored, otherwise files 
will accumulate in the trash and the dfs will fill up. This could be achieved 
by a cron job, but then the admin is required to do one extra step to set up 
dfs, or the namenode could fork off a clean-up process.

I'm 80% for the performance and simplicity an internal thread, 20% for the 
safety of an external cleanup.

What do others think?


> support undelete, snapshots, or other mechanism to recover lost files
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-432
>                 URL: http://issues.apache.org/jira/browse/HADOOP-432
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>         Attachments: undelete12.patch, undelete16.patch, undelete17.patch
>
>
> currently, once you delete a file it's gone forever.
> most file systems allow some form of recovery of deleted files.
> a simple solution would be an 'undelete' command.
> a more comprehensive solution would include snapshots, manual and automatic, 
> with scheduling options.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to