[ 
https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012008#comment-14012008
 ] 

Hangjun Ye commented on HDFS-6382:
----------------------------------

Thanks Chris and Colin for your valuable comments, I'd like to address your 
concern about the "security" problem.

Firstly our scenario is as following:
We have a Hadoop cluster shared by multiple teams for their storage and 
computation requirement and "we" are the dev/supporting team to ensure the 
functionality and availability of the cluster. The cluster is security enabled 
to ensure every team could only access the files that they should. So every 
team is a common user of the cluster and "we" own the superuser.

Currently several teams have the requirement to clean up files based on TTL 
policy. Obviously they could have cron job to do that by themselves but it 
would have many repeated jobs, so we'd better have a mechanism to let them to 
specify/implement their policy easily.

One approach, as you suggested, is we that implement a separate cleanup 
platform and users submit their policy to this platform, and we do the real 
cleanup action to the HDFS on behalf of users (as a superuser or other powerful 
user). But the separate platform has to implement an 
authentication/authorization mechanism to make sure the user is who they claim 
to be and have the permission (authentication is a must, authorization might be 
optional but it'd better have). It's a repeated job as the NameNode has done 
with Kerberos/acl.

If it's implemented inside the NameNode, we could leverage NameNode's 
authentication/authorization mechanism. For example we provide a "./bin/hdfs 
dfs -setttl <path/file>" command (just like -setrep). Users could specify their 
policy by it and the NameNode should persist it somewhere, maybe as an 
attribute of file like replication number. The implemented mechanism inside the 
NameNode would (maybe periodically) execute all policies specified by users, 
and it would do it as a superuser safely, as authentication/authorization have 
been done when user set their policies to the NameNode.

To address several detailed concerns you raised:
* "buggy or malicious code": The proposed concept (actually Haohui proposed) 
should be pretty similar to HBase's coprocessor 
(http://hbase.apache.org/book.html#cp), it's a plug-in or extension of NameNode 
and most likely enabled at deployment time. A common user can't submit it, the 
cluster owner could do. So the code is not arbitrary and the quality/safety 
could be guaranteed.

* "Who exactly is the effective user running the delete, and how do we manage 
their login and file permission enforcement": the extension is run as 
superuser/system, a specific extension implementation could do any permission 
enforcement if needed. For the "TTL-based cleanup policy executor", no 
permission enforcement is needed at this stage as authentication/authorization 
have been done when user set policy.

I think the idea proposed by Haohui is to have an extensible mechanism in 
NameNode to run jobs which intensively depend on namespace data, and make the 
specific job's code as de-coupled from NameNode's core code as possible. For 
certain it's not easy, as Chris pointed out several problems, like HA and 
concurrency, but it might deserve to be thought about.

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>
> In production environment, we always have scenario like this, we want to 
> backup files on hdfs for some time and then hope to delete these files 
> automatically. For example, we keep only 1 day's logs on local disk due to 
> limited disk space, but we need to keep about 1 month's logs in order to 
> debug program bugs, so we keep all the logs on hdfs and delete logs which are 
> older than 1 month. This is a typical scenario of HDFS TTL. So here we 
> propose that hdfs can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after 
> the TTL is expired
> 3. If a TTL is set on a directory, the child files and directories will be 
> deleted automatically after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent 
> directory's
> 5. A global configuration is needed to configure that whether the deleted 
> files/directories should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory 
> with TTL should be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to