[ 
https://issues.apache.org/jira/browse/HADOOP-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455652#comment-13455652
 ] 

Luke Lu commented on HADOOP-8803:
---------------------------------

bq. 1. No, more restrictive HDFS delegation token and Block Token are used to 
do byte-range access control, and new Block Token can reduce the damage when 
Block Token key is compromised.

Block tokens are ephemeral and expire in a few minutes as the shared DN secret 
is refreshed, unlike delegation token that is typically renewed in a longer 
period and often stored in the local storage. I feel that a delegation token 
with embedded authorization data (a la MS' PAC extension in Kerberos) is a 
useful addition, while block token with byte range seems redundant/overkill to 
me.

bq. I would like to test those kinds of job, do you guys have any examples of 
this kind of code I can try to run?

Any tasks that open an hdfs file directly will break with the byte range stuff. 
e.g. TestDFSIO

bq. So for my work, extra workload is only happening when one mapper need to 
access data which is on more than one datanode. And I don't think that is 
always happening.

Replica selection is done at DFSClient side, so the client gets the block 
locations of all the replicas and their block token and in your case, tokens. 
If you don't generate all the tokens for all the replicas, you'll likely have 
to do extra RPC calls, which is even worse.

bq. Another argument is that sharing the same key for all HDFS cluster is too 
risky. This overhead is something hadoop have to paid.

The shared key is in DN memory only and constantly refreshed. Risk only comes 
from OS/software bugs, which don't help unique keys either, in the big scheme 
of things.

bq. if hadoop is running in public cloud, they are maybe running under 
different cloud provider, and OS may different and people who maintaining those 
machines are different.

It's highly unlikely that a single cluster would span multiple providers. A 
more likely scenario would be a cluster in a provider mirroring to a cluster in 
another provider. For cross provider internet traffic, you'd better do TLS 
anyway if you care about security. 

                
> Make Hadoop running more secure public cloud envrionment
> --------------------------------------------------------
>
>                 Key: HADOOP-8803
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8803
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, ipc, security
>    Affects Versions: 0.20.204.0
>            Reporter: Xianqing Yu
>              Labels: hadoop
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> I am a Ph.D student in North Carolina State University. I am modifying the 
> Hadoop's code (which including most parts of Hadoop, e.g. JobTracker, 
> TaskTracker, NameNode, DataNode) to achieve better security.
>  
> My major goal is that make Hadoop running more secure in the Cloud 
> environment, especially for public Cloud environment. In order to achieve 
> that, I redesign the currently security mechanism and achieve following 
> proprieties:
> 1. Bring byte-level access control to Hadoop HDFS. Based on 0.20.204, HDFS 
> access control is based on user or block granularity, e.g. HDFS Delegation 
> Token only check if the file can be accessed by certain user or not, Block 
> Token only proof which block or blocks can be accessed. I make Hadoop can do 
> byte-granularity access control, each access party, user or task process can 
> only access the bytes she or he least needed.
> 2. I assume that in the public Cloud environment, only Namenode, secondary 
> Namenode, JobTracker can be trusted. A large number of Datanode and 
> TaskTracker may be compromised due to some of them may be running under less 
> secure environment. So I re-design the secure mechanism to make the damage 
> the hacker can do to be minimized.
>  
> a. Re-design the Block Access Token to solve wildly shared-key problem of 
> HDFS. In original Block Access Token design, all HDFS (Namenode and Datanode) 
> share one master key to generate Block Access Token, if one DataNode is 
> compromised by hacker, the hacker can get the key and generate any  Block 
> Access Token he or she want.
>  
> b. Re-design the HDFS Delegation Token to do fine-grain access control for 
> TaskTracker and Map-Reduce Task process on HDFS. 
>  
> In the Hadoop 0.20.204, all TaskTrackers can use their kerberos credentials 
> to access any files for MapReduce on HDFS. So they have the same privilege as 
> JobTracker to do read or write tokens, copy job file, etc.. However, if one 
> of them is compromised, every critical thing in MapReduce directory (job 
> file, Delegation Token) is exposed to attacker. I solve the problem by making 
> JobTracker to decide which TaskTracker can access which file in MapReduce 
> Directory on HDFS.
>  
> For Task process, once it get HDFS Delegation Token, it can access everything 
> belong to this job or user on HDFS. By my design, it can only access the 
> bytes it needed from HDFS.
>  
> There are some other improvement in the security, such as TaskTracker can not 
> know some information like blockID from the Block Token (because it is 
> encrypted by my way), and HDFS can set up secure channel to send data as a 
> option.
>  
> By those features, Hadoop can run much securely under uncertain environment 
> such as Public Cloud. I already start to test my prototype. I want to know 
> that whether community is interesting about my work? Is that a value work to 
> contribute to production Hadoop?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to