[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025585#comment-14025585
 ] 

Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------

For the MR strategy, it seems like this could be parallelized fairly easily.  
For example, if you have 5 MR tasks, you can calculate the hash of each path, 
and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the 
paths that are 1 mod 5, and so forth.  MR also doesn't introduce extra 
dependencies since HDFS and MR are packaged together.

I don't understand what you mean by "the mapreduce strategy will have 
additional overheads."  What overheads are you forseeing?

It is true that you need to avoid overloading the NameNode.  But this is a 
concern with any approach, not just the MR one.  It would be good to see a 
section on this.  I think the simplest way to do it is to rate-limit RPCs to 
the NameNode to a configurable rate.

bq. \[for the standalone daemon\] The major advantage of this approach is that 
we don’t need any extra work to finish the TTL work, all will be done in the 
daemon automatically. 

I don't understand what you mean by this.  What will be done automatically?

How are you going to implement HA for the standalone daemon?  I suppose if all 
the state is kept in HDFS, you can simply restart it when it fails.  However, 
it seems like you need to checkpoint how far along in the FS you are, so that 
if you die and later get restarted, you don't have to redo the whole FS scan.  
This implies reading directories in alphabetical order, or similar.  You also 
need to somehow record when the last scan was, perhaps in a file in HDFS.

I don't see a lot of discussion of logging and monitoring in general.  How is 
the user going to become aware that a file was deleted because of a TTL?  Or if 
there is an error during the delete, how will the user know?  Logging is one 
choice here.  Creating a file in HDFS is another.

The setTtl command seems reasonable.  Does this need to be an administrator 
command?

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to 
> backup files on hdfs for some time and then hope to delete these files 
> automatically. For example, we keep only 1 day's logs on local disk due to 
> limited disk space, but we need to keep about 1 month's logs in order to 
> debug program bugs, so we keep all the logs on hdfs and delete logs which are 
> older than 1 month. This is a typical scenario of HDFS TTL. So here we 
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after 
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be 
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent 
> directory's
> 5. A global configuration is needed to configure that whether the deleted 
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory 
> with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to