[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978551#action_12978551
 ] 

Greg Roelofs commented on MAPREDUCE-2238:
-----------------------------------------

bq. I guess it could be a test timing out right as a setPermissions is done, 
interrupting in the middle... but seems pretty unlikely, don't you think?

Yes.  I'm guessing it's more subtle than that and lies within the core MR code 
or the JVM.  The fact that I see it semi-frequently on NFS (that is, more 
frequent than Hudson or production) suggests either timing (NFS is slow), 
perhaps via an erroneous assumption of synchronous behavior, or else an 
erroneous assumption of an infallible system call.  It could be other things as 
well, of course, but those seem to me like the most probable candidates.

bq. I agree we could work around it for the tests, but I'm nervous whether we 
will see this issue crop up in production. Have you guys at Yahoo seen this on 
any clusters running secure YDH?

To clarify, I was suggesting working around it in the MR code itself, not 
realizing that the Hudson backtrace wasn't using MR code at all.  (Well, 
apparently.)  So I'm not sure where that leaves us, other than trying to fix 
the actual set-permissions problem.  Seems like no one's basic 
deleteRecursive() implementation includes an option to attempt a chmod() before 
failing on bad permissions?

Anyway, yes, I _think_ we've seen it in production with 0.20S or later, but it 
wasn't while I was on call, so I might be remembering a different issue with 
similar symptoms.  Sorry...there are lots of interesting failure modes in 
Hadoop, and my memory is finite. :-)


> Undeletable build directories 
> ------------------------------
>
>                 Key: MAPREDUCE-2238
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2238
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: build, test
>    Affects Versions: 0.23.0
>            Reporter: Eli Collins
>
> The MR hudson job is failing, looks like it's due to a test chmod'ing a build 
> directory so the checkout can't clean the build dir.
> https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/549/console
> Building remotely on hadoop7
> hudson.util.IOException2: remote file operation failed: 
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk at 
> hudson.remoting.chan...@2545938c:hadoop7
>       at hudson.FilePath.act(FilePath.java:749)
>       at hudson.FilePath.act(FilePath.java:735)
>       at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:589)
>       at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:537)
>       at hudson.model.AbstractProject.checkout(AbstractProject.java:1116)
>       at 
> hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
>       at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
>       at hudson.model.Run.run(Run.java:1324)
>       at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
>       at hudson.model.ResourceController.execute(ResourceController.java:88)
>       at hudson.model.Executor.run(Executor.java:139)
> Caused by: java.io.IOException: Unable to delete 
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk/trunk/build/test/logs/userlogs/job_20101230131139886_0001/attempt_20101230131139886_0001_m_000000_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to