[
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010632#comment-14010632
]
Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------
bq. Why do you think that putting the cleanup mechanism into the NameNode seems
questionable, can you point out some details?
Andrew and Chris commented about this earlier. See:
https://issues.apache.org/jira/browse/HDFS-6382?focusedCommentId=13998933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13998933
I would add to that:
* Every user of this is going to want a slightly different deletion policy.
It's just way too much configuration for the NameNode to reasonably handle.
Much easier to do it in a user process. For example, maybe you want to keep at
least 100 GB of logs, 100 GB of "foo" data, and 1000 GB of "bar" data. It's
easy to handle this complexity in a user process, incredibly complex and
frustrating to handle it in the NameNode.
* Your nightly MR job (or whatever) also needs to be able to do things like
email sysadmins when the disks are filling up, which the NameNode can't
reasonably be expected to do.
* I don't see a big advantage to doing this in the NameNode, and I see a lot of
disadvantages (more complexity to maintain, difficult configuration, need to
restart to update config)
Maybe I could be convinced otherwise, but so far the only argument that I've
seen for doing it in the NN is that it would be re-usable. And this could just
as easily apply to an implementation outside the NN. For example, as I pointed
out earlier, DistCp is reusable, without being in the NameNode.
> HDFS File/Directory TTL
> -----------------------
>
> Key: HDFS-6382
> URL: https://issues.apache.org/jira/browse/HDFS-6382
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, namenode
> Affects Versions: 2.4.0
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
>
> In production environment, we always have scenario like this, we want to
> backup files on hdfs for some time and then hope to delete these files
> automatically. For example, we keep only 1 day's logs on local disk due to
> limited disk space, but we need to keep about 1 month's logs in order to
> debug program bugs, so we keep all the logs on hdfs and delete logs which are
> older than 1 month. This is a typical scenario of HDFS TTL. So here we
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent
> directory's
> 5. A global configuration is needed to configure that whether the deleted
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory
> with TTL should be deleted when it is emptied by TTL mechanism or not.
--
This message was sent by Atlassian JIRA
(v6.2#6252)