[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026772#comment-14026772
 ] 

Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------

bq. You mean that we scan the whole namespace at first and then split it into 5 
pieces according to hash of the path, why do we just complete the work during 
the first scanning process? If I misunderstand your meaning, please point out.

You need to make one RPC for each file or directory you delete.  In contrast, 
when listing a directory you make only one RPC for every {{dfs.ls.limit}} 
elements (by default 1000).  So if you have 5 workers all listing all 
directories, but only calling delete on some of the files, you still might come 
out ahead in terms of number of RPCs, provided you had a high ratio of files to 
directories.

There are other ways to partition the namespace which are smarter, but rely on 
some knowledge of what is in it, which you'd have to keep track of.

A single node design will work for now, though.  Considering that you probably 
want rate-limiting anyway.

bq. For the simplicity purpose, in the initial version, we will use logs to 
record which file/directory is deleted by TTL, and errors during the deleting 
process.

Even if it's not implemented at first, we should think about the configuration 
required here.  I think we want the ability to email the admins when things go 
wrong.  Possibly the notifier could be pluggable or have several policies.  
There was nothing in the doc about configuration in general, which I think we 
need to fix.  For example, how is rate limiting configurable?  How do we notify 
admins that the rate is too slow to finish in the time given?

bq. It doesn't need to be an administrator command, user only can setTtl on 
file/directory that they have write permission, and can getTtl on 
file/directory that they have read permission.

You can't delete a file in HDFS unless you have write permission on the 
containing directory.  Whether you have write permission on the file itself is 
not relevant.  So I would expect the same semantics here (probably enforced by 
setfacl itself).

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to 
> backup files on hdfs for some time and then hope to delete these files 
> automatically. For example, we keep only 1 day's logs on local disk due to 
> limited disk space, but we need to keep about 1 month's logs in order to 
> debug program bugs, so we keep all the logs on hdfs and delete logs which are 
> older than 1 month. This is a typical scenario of HDFS TTL. So here we 
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after 
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be 
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent 
> directory's
> 5. A global configuration is needed to configure that whether the deleted 
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory 
> with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to