[ https://issues.apache.org/jira/browse/HADOOP-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455420#comment-13455420 ]
Xianqing Yu commented on HADOOP-8803: ------------------------------------- Hi Owen, Thanks for your comments. The performance is always a trade-off for security design. I am close to test performance but not yet. But I do consider performance impact when I design the whole thing. So one example is that, you can see byte-range restrict information is generated in JobClient, then it is stored in HDFS Delegation Token, then it is transferred to Block Access Token, finally it would be checked in DataNodes. Check process is simple, which is that datanode only sends out the bytes the byte-range defined. There is only less than 5 lines of java code to check that and those extra workload would be spread over datanodes in the clusters. (original Hadoop's security check will also be performed, here I only talk about byte-range check) The second point you pointed out is very important. In my current implementation, there are two parts are using byte-range check. One is for TaskTracker when it want to access MapReduce directory on HDFS. Currently I don't see any problem yet. The other part, as you said, is task executing part. If it need to access the content beyond file split defined, it would be a problem. But I think byte-range access control is very important for many job program, and, just as you suggesting, we can include that as an option, and leave the choice to Hadoop's user, better secure or easier to write the code. Thanks for pointing out the change of the block token protocol. I will take a look of that. I think that change should be included in 2.0.1, right? Right, I only fully trust Namenode and Jobtracker. Datanodes and tasktrackers are in less secure zone. The main reason is not only for productivity, but also for the potential large number of datanodes/tasktrackers (it may increase attacking possiblity). About "the datanodes need the ability to be trusted by other datanodes", I decide to leave this work to kerberos. In fact, datanode only trust Namenode by using kerberos. All other authentication would be done by using my Block Token. > Make Hadoop running more secure public cloud envrionment > -------------------------------------------------------- > > Key: HADOOP-8803 > URL: https://issues.apache.org/jira/browse/HADOOP-8803 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, ipc, security > Affects Versions: 0.20.204.0 > Reporter: Xianqing Yu > Labels: hadoop > Original Estimate: 2m > Remaining Estimate: 2m > > I am a Ph.D student in North Carolina State University. I am modifying the > Hadoop's code (which including most parts of Hadoop, e.g. JobTracker, > TaskTracker, NameNode, DataNode) to achieve better security. > > My major goal is that make Hadoop running more secure in the Cloud > environment, especially for public Cloud environment. In order to achieve > that, I redesign the currently security mechanism and achieve following > proprieties: > 1. Bring byte-level access control to Hadoop HDFS. Based on 0.20.204, HDFS > access control is based on user or block granularity, e.g. HDFS Delegation > Token only check if the file can be accessed by certain user or not, Block > Token only proof which block or blocks can be accessed. I make Hadoop can do > byte-granularity access control, each access party, user or task process can > only access the bytes she or he least needed. > 2. I assume that in the public Cloud environment, only Namenode, secondary > Namenode, JobTracker can be trusted. A large number of Datanode and > TaskTracker may be compromised due to some of them may be running under less > secure environment. So I re-design the secure mechanism to make the damage > the hacker can do to be minimized. > > a. Re-design the Block Access Token to solve wildly shared-key problem of > HDFS. In original Block Access Token design, all HDFS (Namenode and Datanode) > share one master key to generate Block Access Token, if one DataNode is > compromised by hacker, the hacker can get the key and generate any Block > Access Token he or she want. > > b. Re-design the HDFS Delegation Token to do fine-grain access control for > TaskTracker and Map-Reduce Task process on HDFS. > > In the Hadoop 0.20.204, all TaskTrackers can use their kerberos credentials > to access any files for MapReduce on HDFS. So they have the same privilege as > JobTracker to do read or write tokens, copy job file, etc.. However, if one > of them is compromised, every critical thing in MapReduce directory (job > file, Delegation Token) is exposed to attacker. I solve the problem by making > JobTracker to decide which TaskTracker can access which file in MapReduce > Directory on HDFS. > > For Task process, once it get HDFS Delegation Token, it can access everything > belong to this job or user on HDFS. By my design, it can only access the > bytes it needed from HDFS. > > There are some other improvement in the security, such as TaskTracker can not > know some information like blockID from the Block Token (because it is > encrypted by my way), and HDFS can set up secure channel to send data as a > option. > > By those features, Hadoop can run much securely under uncertain environment > such as Public Cloud. I already start to test my prototype. I want to know > that whether community is interesting about my work? Is that a value work to > contribute to production Hadoop? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira