[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983267#action_12983267
 ] 

Todd Lipcon commented on MAPREDUCE-2238:
----------------------------------------

Spent some time adding logging and looping the tests to figure out this 
problem. I think I have it cracked.

The issue is not multiple threads calling setPermission() on the same process, 
but rather a case where one thread is calling setPermission on the *parent* 
directory of a file where another thread (actually another entire process) is 
calling setPermission.

In particular, these two invocations race:

2011-01-18 09:00:40,958 INFO  tasktracker.Localizer 
(Localizer.java:setPermissions(129)) - Thread[TaskLauncher for MAP 
tasks,5,main]: About to set permissions on 
/data/1/todd/cdh/repos/cdh3/hadoop-0.20/build/test/logs/userlogs/job_20110118090037816_0001
java.lang.Exception
  at 
org.apache.hadoop.mapreduce.server.tasktracker.Localizer$PermissionsHandler.setPermissions(Localizer.java:129)
  at 
org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeJobLogDir(Localizer.java:429)
  at 
org.apache.hadoop.mapred.TaskTracker.initializeJobLogDir(TaskTracker.java:1072)
  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:969)
  at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209)
2011-01-18 09:00:40,985 INFO  tasktracker.Localizer 
(Localizer.java:setPermissions(129)) - Thread[Thread-213,5,main]: About to set 
permissions on 
/data/1/todd/cdh/repos/cdh3/hadoop-0.20/build/test/logs/userlogs/job_20110118090037816_0001/attempt_20110118090037816_0001_m_000005_0
java.lang.Exception
  at 
org.apache.hadoop.mapreduce.server.tasktracker.Localizer$PermissionsHandler.setPermissions(Localizer.java:129)
  at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:285)
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:198)

The above traces are from an 0.20 branch but I imagine it's the same deal on 
trunk.

The issue is that the top invocation flips the job_<id> directory to 000 
momentarily. During that time, the stat/chmod calls for the attempt directory 
fail with EACCES, which can leave the attempt directory with the wrong 
permissions. I have strace output which shows this as well.

I think we should do away with this Java API nonsense altogether, link in a 
normal chmod call, and use fork by default when native isn't available.

> Undeletable build directories 
> ------------------------------
>
>                 Key: MAPREDUCE-2238
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2238
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: build, test
>    Affects Versions: 0.23.0
>            Reporter: Eli Collins
>         Attachments: mapreduce-2238.txt
>
>
> The MR hudson job is failing, looks like it's due to a test chmod'ing a build 
> directory so the checkout can't clean the build dir.
> https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/549/console
> Building remotely on hadoop7
> hudson.util.IOException2: remote file operation failed: 
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk at 
> hudson.remoting.Channel@2545938c:hadoop7
>       at hudson.FilePath.act(FilePath.java:749)
>       at hudson.FilePath.act(FilePath.java:735)
>       at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:589)
>       at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:537)
>       at hudson.model.AbstractProject.checkout(AbstractProject.java:1116)
>       at 
> hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
>       at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
>       at hudson.model.Run.run(Run.java:1324)
>       at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
>       at hudson.model.ResourceController.execute(ResourceController.java:88)
>       at hudson.model.Executor.run(Executor.java:139)
> Caused by: java.io.IOException: Unable to delete 
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk/trunk/build/test/logs/userlogs/job_20101230131139886_0001/attempt_20101230131139886_0001_m_000000_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to