Hi folks,

HDFS has a POSIX-like permission model, using R,W,X and owner, group, other
for access control. It is good most of the time, except for:

1. Data need to be shared among users

group can be used for access control, and the users has to be in the same
GROUP as the data. the GROUP here stand for the sharing relationship
between users and data. If many sharing relationships exists, there are
many groups. It is hard to manage.

2. Hive

Hive use a table based access control model, user can have SELECT,  UPDATE,
CREATE, DROP privileges on certain table, which means R/W permission in
HDFS. However, Hive’s table based authorization doesn’t match HDFS’s
POSIX-like model. For hive user accessing HDFS, Group permissions can be
deployed, which introduces many groups, or big groups contains many sharing
relationship.

Inspired by RDBMS’s way of manage data, a  directory level access control
based on authorized user impersonate can be implemented as a extension to
POSIX-like permission model.

it consist of:

1. ACLFileSystem

2. authorization manager: hold access control information and a shared
secret with namenode

3. authenticator(embedded in namenode)

Take hive as a example, owner of the data is user DW. The procedure is:

1. user submit a hive query or a hcatalog job to access DW’s data, we can
get the read table/partition and write table/partition, and the
corresponding hdfs path. Then a RPC call to authorization manager is
invoked, send

{user, tablename, table_path, w/r}

2. authorization manager do a authorization check to find whether it is
allowed. If allowed, reply a encrypted tablepath:

{realuser, encrypted(tablepath+w/r)}

realuser here stand for the owner of the requested data

3. ACLFilesystem extends FileSystem and when a open(path) call is invoked ,
it replace the path to encrypted(tablepath+w/r) and invoke the namenode RPC
call, such as

open(realuser, encrypted(tablepath+w/r), null)

If the user is requesting a partition path, the rpc call can be invoked as

open(realuser, encrypted(tablepath+w/r), path_suffix)

4. Namenode pick up the RPC call, decrypt the encrypted(hdfspath+w/r) with
the shared secret to verify whether it is fake. If it is true, check w/r
operation, join the  tablepath and path_suffix, and invoke the call as
hdfspath owner, for example user DW.


delegation token or something else can be used as the shared secret, and
authorization manager can be integrated into hive metastore.

In general, I propose a HDFS user impersonate mechanism and a authorization
mechanism based on HDFS user impersonation.

If the community is interested, I will file a jira for HDFS user
impersonation and a jira for authorization manager soon.


Thoughts?

Thanks a lot
Erik.fang

Reply via email to