On Wed, Jul 18, 2018 at 10:21 AM, Todd Lipcon <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 5:27 PM, Sailesh Mukil > <[email protected] > > wrote: > > > On Tue, Jul 17, 2018 at 2:47 PM, Todd Lipcon <[email protected]> > > wrote: > > > > > Hey folks, > > > > > > I'm working on a regression test for IMPALA-7311 and found something > > > interesting. It appears that in our normal minicluster setup, impalad > > runs > > > as the same username as the namenode (namely, the username of the > > > developer, in my case 'todd'). > > > > > > This means that the NN treats impala as a superuser, and therefore > > doesn't > > > actually enforce permissions. So, tests about the behavior of Impala on > > > files that it doesn't have access to are somewhat tricky to write. > > > > > > > > What kind of files do you specifically mean? Something that the daemon > > tries to access directly (Eg: keytab file, log files, etc.) ? I'm > guessing > > it's not this since you mentioned the NN. > > > > Or files that belong to a table/partition in HDFS? If it's this case, we > > would go through Sentry before accessing files that belong to a table, > and > > access would be determined by Sentry on the "session user" (not the > impalad > > user) before Impala even tries to access HDFS. (Eg: > > tests/authorization/test_authorization.py) > > > > Right, files on HDFS. I mean that, in cases where Sentry is not enabled or > set up, and even in some cases where it is set up but not synchronized with > HDFS, it's possible that the user can point table metadata at files or > directories that aren't writable to the 'impala' user on HDFS. For example, > I can do: > > CREATE EXTERNAL TABLE foo (...) LOCATION '/user/todd/my-dir'; > > and it's likely that 'my-dir' is not writable by 'impala' on a real > cluster. Thus, if I try to insert into it, I get an error because "impala" > does not have HDFS permissions to access this directory. > > Currently, the frontend does some checks here to try to produce a nice > error. But, those checks are based on cached metadata which could be in > accurate. In the case that it's inaccurate, the error will be thrown from > the backend when it tries to create a file in a non-writable location. > > In the minicluster environment, it's impossible to test this case (actual > permissions enforced by the NN causing an error) because the backend is > running as an HDFS superuser. That is to say, it has full permissions > everywhere. That's due to the special case behavior that HDFS has: it > determines the name of the superuser to be the username that is running the > NN. Since in the minicluster, both impala and the NN run as 'todd' in my > case, impala acts as superuser. In a real cluster (even with security > disabled) impala typically runs as 'impala' whereas the NN runs as 'hdfs' > and thus impala does not have superuser privileges. > This makes sense, thanks for the explanation. The 'HADOOP_USER_NAME' approach seems like a good way to go, but as Phil said, might cause issues with other components (or not). > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera >
