Hi Andy, I can reproduce the problem and I believe it is a bug. The output directory should be owned by the user submitting the job, not the task tracker account. Do you want to file a jira? or I can do it.
Thanks. Nicholas ----- Original Message ---- From: Andy Li <[EMAIL PROTECTED]> To: core-dev@hadoop.apache.org Sent: Wednesday, February 27, 2008 1:22:26 AM Subject: Re: mapreduce does the wrong thing with dfs permissions? I also encountered the same problem when running the MapReduce code as a different user name. For example, assuming I have installed Hadoop with an account 'hadoop' and I am going to run my program with user account 'test'. I have created an input folder as /user/test/input/ with user 'test' and the permission is set to 0775. /user/test/input <dir> 2008-02-27 01:20 rwxr-xr-x test hadoop When I run the MapReduce code, the output I specified will be set to user 'hadoop' instead of 'test'. ${HADOOP_HOME}/bin/hadoop jar /tmp/test_perm.jar -m 57 -r 3 "/user/test/input/l" "/user/test/output/" The directory "/user/test/output/" will have the following permission and user:group. /user/test/output <dir> 2008-02-27 03:53 rwxr-xr-x hadoop hadoop My question will be - Why is the output folder set to the super user 'hadoop' ? and of course, the MapReduce code cannot access this folder because the permission does not allow user 'test' to write to this folder. So the output folder was created, but the user account 'test' cannot write anything to this folder and therefore threw an exception. See the following for the exception. I have been looking for solution to solve this, but cannot find an exact answer. How do I set the default umask to 0775? I can add the user 'test' to group 'hadoop' so the user 'test' can have write access to the folder within 'hadoop' group. In other word, as long as the folder is set to 'rwxrwxr-x', user 'test' can read/write to the folder and share the folder with 'hadoop:hadoop'. Any idea how I can set or modify the global default umask for Hadoop? or do I have to always override the default umask value in my configuration or FileSystem? ======= COPY/PASTE STARTS HERE ======= org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=test, access=WRITE, inode="_task_200802262256_0007_r_000001_1":hadoop:hadoop:rwxr-xr-x at org.apache.hadoop.dfs.PermissionChecker.check( PermissionChecker.java:173) at org.apache.hadoop.dfs.PermissionChecker.check( PermissionChecker.java:154) at org.apache.hadoop.dfs.PermissionChecker.checkPermission( PermissionChecker.java:102) at org.apache.hadoop.dfs.FSNamesystem.checkPermission( FSNamesystem.java:4035) at org.apache.hadoop.dfs.FSNamesystem.checkAncestorAccess( FSNamesystem.java:4005) at org.apache.hadoop.dfs.FSNamesystem.startFileInternal( FSNamesystem.java:963) at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java :938) at org.apache.hadoop.dfs.NameNode.create(NameNode.java:281) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:899) at org.apache.hadoop.ipc.Client.call(Client.java:512) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod( RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke( RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.<init>( DFSClient.java:1927) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:382) at org.apache.hadoop.dfs.DistributedFileSystem.create( DistributedFileSystem.java:135) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:436) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:336) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter( TextOutputFormat.java:116) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:308) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java :2089) ======= COPY/PASTE ENDS HERE ======= On Tue, Feb 26, 2008 at 9:47 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Feb 26, 2008, at 3:05 PM, Michael Bieniosek wrote: > > > Ah, that makes sense. > > > > I have things set up this way because I can't trust code that gets > > run on the tasktrackers: we have to prevent the tasktrackers from > > eg. sending kill signals to the datanodes. I didn't think about > > the jobtracker, but I suppose I should equally not trust code that > > gets run on the jobtracker... > > Just to be clear, no user code is run in the JobTracker or > TaskTracker. User code is only run in the client and task processes. > However, it makes a lot of sense to run map/reduce as a different > user than hdfs to prevent the task processes from having access to > the raw blocks or datanodes. > > -- Owen >