Hello hadoop mailing list. I'm an intern at a software company somewhere that's been tasked with adding file permissions to hadoop. I've begun a discussion with Doug Cutting about how to accomplish that, and he suggested that I move it to the mailing list.
So here it is. If you have any suggestions about reasonable ways to implement this, feel free to chime in. Excuse the poor formatting as well, I had to add some stuff back in for completeness. Date: Apr 18, 2007 4:48 PM Subject: Re: hadoop file permissions To: Doug Cutting <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] Comments in line: On 4/18/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
Kurtis Heimerl wrote: So I thought I'd throw my rough design idea in front of you as soon as possible. Once we decide it's ballpark, I'll push it to the general community. >So, I see this split into two separate problems. First is the authorization. I agree that kerberos is the way to do that. This will authorize a subject, allowing us to get their user name. >Following this, we have the problem of securing the system. The way I understand that it should work is that we take the user name discovered above and look up the UID and GID for that user on the local machine. We then store this with the file, probably adding metadata to the namenode. So, my plan is to implement the second part, with us assuming that > whatever user name the client sends is valid. I'll leave the > authentication of that until I've completed the FS work. Assuming I have > time, i'll then set up the kerberos part. The discussions I've had with > people indicate that it's an extremely difficult problem. That sounds like a fine approach to me.
Good.
The split seems to happen at DFSClient.java. It's there that we actually > call the namenode, seemingly via RPC calls. I'll modify this to send the > $USERNAME variable for now, and then set up the file system to use that > information. Yes, DFSClient will need to pass the user to the namenode. Perhaps the username should be put in the FileSystem's URI. So an HDFS URI would become hdfs://[EMAIL PROTECTED]:5555/foo/bar. URI's without a username would have "other" access (typically read-only).
That's reasonable. I don't know how kerberos plays with that though.
This will require all people making calls to namenode to have accounts > on the namenode box. No, since we're not checking usernames in the client (anyone can set that environment variable) there's no reason to validate them server-side either, is there? We should have an equivalent of /etc/groups in the namenode.
Well, it's my understanding that kerberos sends you more than the username, it sends the level of privilege you're currently at. So, if you could change your UID, then you could run as someone else. however, that's totally reasonable. it's not simply changing your $USER environment variable. So, what I think it does is that it validates that the user really is [EMAIL PROTECTED] This is the information we get from kerberos. The idea was to take this information and map it to a hadoop user somehow. The obvious way to me was to look up user on our own machine, but now i realize that is a flawed system. There's a chance kerberos actually validates that it's [EMAIL PROTECTED] If that's the case, then every user will require an account on the server. It would really simplify the design though, as we could just use user as the ID in hadoop. This is what i'm looking into, and I haven't made much progress today.
Also, I'm not entirely sure how to get the UID and > GID from the kernel. We shouldn't need to. HDFS can have its own UID and and GID database, or simply use strings everywhere. That's a namenode implementation detail. For example, there may be no persistent UIDs or GIDs. We might use ints in memory to save space, and use these to index tables of strings, but always record the strings when persisting namenode data. Finally, it would be good to move this discussion to the mailing list or Jira sooner rather than later.
I'll CC my mailing list account and then forward it there. Doug