Hi folks,
HDFS has a POSIX-like permission model, using R,W,X and owner, group, other for access control. It is good most of the time, except for: 1. Data need to be shared among users group can be used for access control, and the users has to be in the same GROUP as the data. the GROUP here stand for the sharing relationship between users and data. If many sharing relationships exists, there are many groups. It is hard to manage. 2. Hive Hive use a table based access control model, user can have SELECT, UPDATE, CREATE, DROP privileges on certain table, which means R/W permission in HDFS. However, Hive’s table based authorization doesn’t match HDFS’s POSIX-like model. For hive user accessing HDFS, Group permissions can be deployed, which introduces many groups, or big groups contains many sharing relationship. Inspired by RDBMS’s way of manage data, a directory level access control based on authorized user impersonate can be implemented as a extension to POSIX-like permission model. it consist of: 1. ACLFileSystem 2. authorization manager: hold access control information and a shared secret with namenode 3. authenticator(embedded in namenode) Take hive as a example, owner of the data is user DW. The procedure is: 1. user submit a hive query or a hcatalog job to access DW’s data, we can get the read table/partition and write table/partition, and the corresponding hdfs path. Then a RPC call to authorization manager is invoked, send {user, tablename, table_path, w/r} 2. authorization manager do a authorization check to find whether it is allowed. If allowed, reply a encrypted tablepath: {realuser, encrypted(tablepath+w/r)} realuser here stand for the owner of the requested data 3. ACLFilesystem extends FileSystem and when a open(path) call is invoked , it replace the path to encrypted(tablepath+w/r) and invoke the namenode RPC call, such as open(realuser, encrypted(tablepath+w/r), null) If the user is requesting a partition path, the rpc call can be invoked as open(realuser, encrypted(tablepath+w/r), path_suffix) 4. Namenode pick up the RPC call, decrypt the encrypted(hdfspath+w/r) with the shared secret to verify whether it is fake. If it is true, check w/r operation, join the tablepath and path_suffix, and invoke the call as hdfspath owner, for example user DW. delegation token or something else can be used as the shared secret, and authorization manager can be integrated into hive metastore. In general, I propose a HDFS user impersonate mechanism and a authorization mechanism based on HDFS user impersonation. If the community is interested, I will file a jira for HDFS user impersonation and a jira for authorization manager soon. Thoughts? Thanks a lot Erik.fang