I filed HDFS-7917 to change the way to simulate disk failures. But I think we still need infrastructure folks to help with jenkins scripts to clean the dirs left today.
On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricet...@gmail.com> wrote: > Any updates on this issues? It seems that all HDFS jenkins builds are > still failing. > > Regards, > Haohui > > On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <vinayakum...@apache.org> > wrote: >> I think the problem started from here. >> >> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/ >> >> As Chris mentioned TestDataNodeVolumeFailure is changing the permission. >> But in this patch, ReplicationMonitor got NPE and it got terminate signal, >> due to which MiniDFSCluster.shutdown() throwing Exception. >> >> But, TestDataNodeVolumeFailure#teardown() is restoring those permission >> after shutting down cluster. So in this case IMO, permissions were never >> restored. >> >> >> @After >> public void tearDown() throws Exception { >> if(data_fail != null) { >> FileUtil.setWritable(data_fail, true); >> } >> if(failedDir != null) { >> FileUtil.setWritable(failedDir, true); >> } >> if(cluster != null) { >> cluster.shutdown(); >> } >> for (int i = 0; i < 3; i++) { >> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true); >> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true); >> } >> } >> >> >> Regards, >> Vinay >> >> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B <vinayakum...@apache.org> >> wrote: >> >>> When I see the history of these kind of builds, All these are failed on >>> node H9. >>> >>> I think some or the other uncommitted patch would have created the problem >>> and left it there. >>> >>> >>> Regards, >>> Vinay >>> >>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <bus...@cloudera.com> wrote: >>> >>>> You could rely on a destructive git clean call instead of maven to do the >>>> directory removal. >>>> >>>> -- >>>> Sean >>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> wrote: >>>> >>>> > Is there a maven plugin or setting we can use to simply remove >>>> > directories that have no executable permissions on them? Clearly we >>>> > have the permission to do this from a technical point of view (since >>>> > we created the directories as the jenkins user), it's simply that the >>>> > code refuses to do it. >>>> > >>>> > Otherwise I guess we can just fix those tests... >>>> > >>>> > Colin >>>> > >>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <l...@cloudera.com> wrote: >>>> > > Thanks a lot for looking into HDFS-7722, Chris. >>>> > > >>>> > > In HDFS-7722: >>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions in >>>> > TearDown(). >>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally clause. >>>> > > >>>> > > Also I ran mvn test several times on my machine and all tests passed. >>>> > > >>>> > > However, since in DiskChecker#checkDirAccess(): >>>> > > >>>> > > private static void checkDirAccess(File dir) throws >>>> DiskErrorException { >>>> > > if (!dir.isDirectory()) { >>>> > > throw new DiskErrorException("Not a directory: " >>>> > > + dir.toString()); >>>> > > } >>>> > > >>>> > > checkAccessByFileMethods(dir); >>>> > > } >>>> > > >>>> > > One potentially safer alternative is replacing data dir with a regular >>>> > > file to stimulate disk failures. >>>> > > >>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth < >>>> cnaur...@hortonworks.com> >>>> > wrote: >>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure, >>>> > >> TestDataNodeVolumeFailureReporting, and >>>> > >> TestDataNodeVolumeFailureToleration all remove executable permissions >>>> > from >>>> > >> directories like the one Colin mentioned to simulate disk failures at >>>> > data >>>> > >> nodes. I reviewed the code for all of those, and they all appear to >>>> be >>>> > >> doing the necessary work to restore executable permissions at the >>>> end of >>>> > >> the test. The only recent uncommitted patch I¹ve seen that makes >>>> > changes >>>> > >> in these test suites is HDFS-7722. That patch still looks fine >>>> > though. I >>>> > >> don¹t know if there are other uncommitted patches that changed these >>>> > test >>>> > >> suites. >>>> > >> >>>> > >> I suppose it¹s also possible that the JUnit process unexpectedly died >>>> > >> after removing executable permissions but before restoring them. >>>> That >>>> > >> always would have been a weakness of these test suites, regardless of >>>> > any >>>> > >> recent changes. >>>> > >> >>>> > >> Chris Nauroth >>>> > >> Hortonworks >>>> > >> http://hortonworks.com/ >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <a...@cloudera.com> wrote: >>>> > >> >>>> > >>>Hey Colin, >>>> > >>> >>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's going on >>>> with >>>> > >>>these boxes. He took a look and concluded that some perms are being >>>> set >>>> > in >>>> > >>>those directories by our unit tests which are precluding those files >>>> > from >>>> > >>>getting deleted. He's going to clean up the boxes for us, but we >>>> should >>>> > >>>expect this to keep happening until we can fix the test in question >>>> to >>>> > >>>properly clean up after itself. >>>> > >>> >>>> > >>>To help narrow down which commit it was that started this, Andrew >>>> sent >>>> > me >>>> > >>>this info: >>>> > >>> >>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS- >>>> > >>>> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/ >>>> > has >>>> > >>>500 perms, so I'm guessing that's the problem. Been that way since >>>> 9:32 >>>> > >>>UTC >>>> > >>>on March 5th." >>>> > >>> >>>> > >>>-- >>>> > >>>Aaron T. Myers >>>> > >>>Software Engineer, Cloudera >>>> > >>> >>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe <cmcc...@apache.org >>>> > >>>> > >>>wrote: >>>> > >>> >>>> > >>>> Hi all, >>>> > >>>> >>>> > >>>> A very quick (and not thorough) survey shows that I can't find any >>>> > >>>> jenkins jobs that succeeded from the last 24 hours. Most of them >>>> seem >>>> > >>>> to be failing with some variant of this message: >>>> > >>>> >>>> > >>>> [ERROR] Failed to execute goal >>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean >>>> (default-clean) >>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed to delete >>>> > >>>> >>>> > >>>> >>>> > >>>> > >>>> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr >>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3 >>>> > >>>> -> [Help 1] >>>> > >>>> >>>> > >>>> Any ideas how this happened? Bad disk, unit test setting wrong >>>> > >>>> permissions? >>>> > >>>> >>>> > >>>> Colin >>>> > >>>> >>>> > >> >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > Lei (Eddy) Xu >>>> > > Software Engineer, Cloudera >>>> > >>>> >>> >>> -- Lei (Eddy) Xu Software Engineer, Cloudera