[
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012008#comment-14012008
]
Hangjun Ye commented on HDFS-6382:
----------------------------------
Thanks Chris and Colin for your valuable comments, I'd like to address your
concern about the "security" problem.
Firstly our scenario is as following:
We have a Hadoop cluster shared by multiple teams for their storage and
computation requirement and "we" are the dev/supporting team to ensure the
functionality and availability of the cluster. The cluster is security enabled
to ensure every team could only access the files that they should. So every
team is a common user of the cluster and "we" own the superuser.
Currently several teams have the requirement to clean up files based on TTL
policy. Obviously they could have cron job to do that by themselves but it
would have many repeated jobs, so we'd better have a mechanism to let them to
specify/implement their policy easily.
One approach, as you suggested, is we that implement a separate cleanup
platform and users submit their policy to this platform, and we do the real
cleanup action to the HDFS on behalf of users (as a superuser or other powerful
user). But the separate platform has to implement an
authentication/authorization mechanism to make sure the user is who they claim
to be and have the permission (authentication is a must, authorization might be
optional but it'd better have). It's a repeated job as the NameNode has done
with Kerberos/acl.
If it's implemented inside the NameNode, we could leverage NameNode's
authentication/authorization mechanism. For example we provide a "./bin/hdfs
dfs -setttl <path/file>" command (just like -setrep). Users could specify their
policy by it and the NameNode should persist it somewhere, maybe as an
attribute of file like replication number. The implemented mechanism inside the
NameNode would (maybe periodically) execute all policies specified by users,
and it would do it as a superuser safely, as authentication/authorization have
been done when user set their policies to the NameNode.
To address several detailed concerns you raised:
* "buggy or malicious code": The proposed concept (actually Haohui proposed)
should be pretty similar to HBase's coprocessor
(http://hbase.apache.org/book.html#cp), it's a plug-in or extension of NameNode
and most likely enabled at deployment time. A common user can't submit it, the
cluster owner could do. So the code is not arbitrary and the quality/safety
could be guaranteed.
* "Who exactly is the effective user running the delete, and how do we manage
their login and file permission enforcement": the extension is run as
superuser/system, a specific extension implementation could do any permission
enforcement if needed. For the "TTL-based cleanup policy executor", no
permission enforcement is needed at this stage as authentication/authorization
have been done when user set policy.
I think the idea proposed by Haohui is to have an extensible mechanism in
NameNode to run jobs which intensively depend on namespace data, and make the
specific job's code as de-coupled from NameNode's core code as possible. For
certain it's not easy, as Chris pointed out several problems, like HA and
concurrency, but it might deserve to be thought about.
> HDFS File/Directory TTL
> -----------------------
>
> Key: HDFS-6382
> URL: https://issues.apache.org/jira/browse/HDFS-6382
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, namenode
> Affects Versions: 2.4.0
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
>
> In production environment, we always have scenario like this, we want to
> backup files on hdfs for some time and then hope to delete these files
> automatically. For example, we keep only 1 day's logs on local disk due to
> limited disk space, but we need to keep about 1 month's logs in order to
> debug program bugs, so we keep all the logs on hdfs and delete logs which are
> older than 1 month. This is a typical scenario of HDFS TTL. So here we
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent
> directory's
> 5. A global configuration is needed to configure that whether the deleted
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory
> with TTL should be deleted when it is emptied by TTL mechanism or not.
--
This message was sent by Atlassian JIRA
(v6.2#6252)