[
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026047#comment-14026047
]
Zesheng Wu commented on HDFS-6382:
----------------------------------
Thanks [~cmccabe] for your feedback.
bq. For the MR strategy, it seems like this could be parallelized fairly
easily. For example, if you have 5 MR tasks, you can calculate the hash of each
path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all
the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra
dependencies since HDFS and MR are packaged together.
You mean that we scan the whole namespace at first and then split it into 5
pieces according to hash of the path, why do we just complete the work during
the first scanning process? If I misunderstand your meaning, please point out.
bq. I don't understand what you mean by "the mapreduce strategy will have
additional overheads." What overheads are you foreseeing?
Possible overheads: Starting a mapreduce job needs to split the input, start an
AppMaster, collect result from random machines (Perhaps 'overheads' is not a
proper word here)
bq. I don't understand what you mean by this. What will be done automatically?
Here "automatically" means we do not have to rely on external tools, the daemon
itself can manage the work well.
bq. How are you going to implement HA for the standalone daemon?
Good point. As you suggested, one approach is save the state in HDFS and simply
restart it when it fails. But managing the state is a complex work, I am
considering how to simplify this. One possible simpler approach is that we can
consider that the daemon is stateless and simply restart it when if fails. We
needn't do checkpoint and just scan from the beginning when it restarts.
Because we can require that the work the daemon does is idempotent, starting
from the beginning will be harmless. Possible drawbacks of the later approach
are that it may waste some time and may delay the work, but they are
acceptable.
bq. I don't see a lot of discussion of logging and monitoring in general. How
is the user going to become aware that a file was deleted because of a TTL? Or
if there is an error during the delete, how will the user know?
For the simplicity purpose, in the initial version, we will use logs to record
which file/directory is deleted by TTL, and errors during the deleting process.
bq. Does this need to be an administrator command?
It doesn't need to be an administrator command, user only can setTtl on
file/directory that they have write permission, and can getTtl on
file/directory that they have read permission.
> HDFS File/Directory TTL
> -----------------------
>
> Key: HDFS-6382
> URL: https://issues.apache.org/jira/browse/HDFS-6382
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, namenode
> Affects Versions: 2.4.0
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
> Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to
> backup files on hdfs for some time and then hope to delete these files
> automatically. For example, we keep only 1 day's logs on local disk due to
> limited disk space, but we need to keep about 1 month's logs in order to
> debug program bugs, so we keep all the logs on hdfs and delete logs which are
> older than 1 month. This is a typical scenario of HDFS TTL. So here we
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent
> directory's
> 5. A global configuration is needed to configure that whether the deleted
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory
> with TTL should be deleted when it is emptied by TTL mechanism or not.
--
This message was sent by Atlassian JIRA
(v6.2#6252)