[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

Colin Patrick McCabe (JIRA) Fri, 30 May 2014 11:05:20 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014030#comment-14014030
 ]


Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------

Chris, Andrew, and I have brought up a lot of reasons why this probably doesn't 
make sense in the NameNode.

Just to summarize:
* security / correctness concerns: it's easy to make a mistake that could bring 
down the NameNode or entire FS
* non-generality to systems using s3 or another FS in addition to HDFS
* issues with federation (which NN does the cleanup?  How do you decide?)
* complexities surrounding our client-side Trash implementation and our 
server-side snapshots
* configuration burden on sysadmins
* inability to change the cleanup code without restarting the NameNode
* HA concerns (need to avoid split-brain or lost updates)
* error handling (where do users find out about errors?)
* semantics: disappearing or time-limited files is an unfamiliar API, not like 
the traditional FS APIs we usually implement

Making this pluggable doesn't fix any of those problems, and it adds some more:
* API stability issues (the INode and Feature classes have changed a lot, and 
we make no guarantees there)
* CLASSPATH issues (if I want to send an email about a cleanup job with the 
FooEmailer library, how do I get that into the NameNode's CLASSPATH?)  How do I 
avoid jar conflicts?

The only points I've seen raised in favor of doing this in the NameNode are:
* the NameNode already has an authorization system which this could use.
* HBase has coprocessors which also allow loading arbitrary code.

To the first point, there are lots of other ways to deal with authorization, 
like by using YARN (which also has authorization), or configuring the cleanup 
using files in HDFS.

To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses 
them for things like secondary indices, a much better-defined problem.  The 
functionality you want is not something that should be implemented as a 
coprocessor, even if we had those.

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>
> In production environment, we always have scenario like this, we want to 
> backup files on hdfs for some time and then hope to delete these files 
> automatically. For example, we keep only 1 day's logs on local disk due to 
> limited disk space, but we need to keep about 1 month's logs in order to 
> debug program bugs, so we keep all the logs on hdfs and delete logs which are 
> older than 1 month. This is a typical scenario of HDFS TTL. So here we 
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after 
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be 
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent 
> directory's
> 5. A global configuration is needed to configure that whether the deleted 
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory 
> with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL

Reply via email to