[
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026772#comment-14026772
]
Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------
bq. You mean that we scan the whole namespace at first and then split it into 5
pieces according to hash of the path, why do we just complete the work during
the first scanning process? If I misunderstand your meaning, please point out.
You need to make one RPC for each file or directory you delete. In contrast,
when listing a directory you make only one RPC for every {{dfs.ls.limit}}
elements (by default 1000). So if you have 5 workers all listing all
directories, but only calling delete on some of the files, you still might come
out ahead in terms of number of RPCs, provided you had a high ratio of files to
directories.
There are other ways to partition the namespace which are smarter, but rely on
some knowledge of what is in it, which you'd have to keep track of.
A single node design will work for now, though. Considering that you probably
want rate-limiting anyway.
bq. For the simplicity purpose, in the initial version, we will use logs to
record which file/directory is deleted by TTL, and errors during the deleting
process.
Even if it's not implemented at first, we should think about the configuration
required here. I think we want the ability to email the admins when things go
wrong. Possibly the notifier could be pluggable or have several policies.
There was nothing in the doc about configuration in general, which I think we
need to fix. For example, how is rate limiting configurable? How do we notify
admins that the rate is too slow to finish in the time given?
bq. It doesn't need to be an administrator command, user only can setTtl on
file/directory that they have write permission, and can getTtl on
file/directory that they have read permission.
You can't delete a file in HDFS unless you have write permission on the
containing directory. Whether you have write permission on the file itself is
not relevant. So I would expect the same semantics here (probably enforced by
setfacl itself).
> HDFS File/Directory TTL
> -----------------------
>
> Key: HDFS-6382
> URL: https://issues.apache.org/jira/browse/HDFS-6382
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, namenode
> Affects Versions: 2.4.0
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
> Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to
> backup files on hdfs for some time and then hope to delete these files
> automatically. For example, we keep only 1 day's logs on local disk due to
> limited disk space, but we need to keep about 1 month's logs in order to
> debug program bugs, so we keep all the logs on hdfs and delete logs which are
> older than 1 month. This is a typical scenario of HDFS TTL. So here we
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent
> directory's
> 5. A global configuration is needed to configure that whether the deleted
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory
> with TTL should be deleted when it is emptied by TTL mechanism or not.
--
This message was sent by Atlassian JIRA
(v6.2#6252)