I'm on it. HADOOP-11721 On Mon, Mar 16, 2015 at 3:44 PM, Haohui Mai <whe...@apache.org> wrote:
> +1 for git clean. > > Colin, can you please get it in ASAP? Currently due to the jenkins > issues, we cannot close the 2.7 blockers. > > Thanks, > Haohui > > > > On Mon, Mar 16, 2015 at 11:54 AM, Colin P. McCabe <cmcc...@apache.org> > wrote: > > If all it takes is someone creating a test that makes a directory > > without -x, this is going to happen over and over. > > > > Let's just fix the problem at the root by running "git clean -fqdx" in > > our jenkins scripts. If there's no objections I will add this in and > > un-break the builds. > > > > best, > > Colin > > > > On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <l...@cloudera.com> wrote: > >> I filed HDFS-7917 to change the way to simulate disk failures. > >> > >> But I think we still need infrastructure folks to help with jenkins > >> scripts to clean the dirs left today. > >> > >> On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricet...@gmail.com> wrote: > >>> Any updates on this issues? It seems that all HDFS jenkins builds are > >>> still failing. > >>> > >>> Regards, > >>> Haohui > >>> > >>> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B < > vinayakum...@apache.org> wrote: > >>>> I think the problem started from here. > >>>> > >>>> > https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/ > >>>> > >>>> As Chris mentioned TestDataNodeVolumeFailure is changing the > permission. > >>>> But in this patch, ReplicationMonitor got NPE and it got terminate > signal, > >>>> due to which MiniDFSCluster.shutdown() throwing Exception. > >>>> > >>>> But, TestDataNodeVolumeFailure#teardown() is restoring those > permission > >>>> after shutting down cluster. So in this case IMO, permissions were > never > >>>> restored. > >>>> > >>>> > >>>> @After > >>>> public void tearDown() throws Exception { > >>>> if(data_fail != null) { > >>>> FileUtil.setWritable(data_fail, true); > >>>> } > >>>> if(failedDir != null) { > >>>> FileUtil.setWritable(failedDir, true); > >>>> } > >>>> if(cluster != null) { > >>>> cluster.shutdown(); > >>>> } > >>>> for (int i = 0; i < 3; i++) { > >>>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true); > >>>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true); > >>>> } > >>>> } > >>>> > >>>> > >>>> Regards, > >>>> Vinay > >>>> > >>>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B < > vinayakum...@apache.org> > >>>> wrote: > >>>> > >>>>> When I see the history of these kind of builds, All these are failed > on > >>>>> node H9. > >>>>> > >>>>> I think some or the other uncommitted patch would have created the > problem > >>>>> and left it there. > >>>>> > >>>>> > >>>>> Regards, > >>>>> Vinay > >>>>> > >>>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <bus...@cloudera.com> > wrote: > >>>>> > >>>>>> You could rely on a destructive git clean call instead of maven to > do the > >>>>>> directory removal. > >>>>>> > >>>>>> -- > >>>>>> Sean > >>>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> > wrote: > >>>>>> > >>>>>> > Is there a maven plugin or setting we can use to simply remove > >>>>>> > directories that have no executable permissions on them? Clearly > we > >>>>>> > have the permission to do this from a technical point of view > (since > >>>>>> > we created the directories as the jenkins user), it's simply that > the > >>>>>> > code refuses to do it. > >>>>>> > > >>>>>> > Otherwise I guess we can just fix those tests... > >>>>>> > > >>>>>> > Colin > >>>>>> > > >>>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <l...@cloudera.com> wrote: > >>>>>> > > Thanks a lot for looking into HDFS-7722, Chris. > >>>>>> > > > >>>>>> > > In HDFS-7722: > >>>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions in > >>>>>> > TearDown(). > >>>>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally > clause. > >>>>>> > > > >>>>>> > > Also I ran mvn test several times on my machine and all tests > passed. > >>>>>> > > > >>>>>> > > However, since in DiskChecker#checkDirAccess(): > >>>>>> > > > >>>>>> > > private static void checkDirAccess(File dir) throws > >>>>>> DiskErrorException { > >>>>>> > > if (!dir.isDirectory()) { > >>>>>> > > throw new DiskErrorException("Not a directory: " > >>>>>> > > + dir.toString()); > >>>>>> > > } > >>>>>> > > > >>>>>> > > checkAccessByFileMethods(dir); > >>>>>> > > } > >>>>>> > > > >>>>>> > > One potentially safer alternative is replacing data dir with a > regular > >>>>>> > > file to stimulate disk failures. > >>>>>> > > > >>>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth < > >>>>>> cnaur...@hortonworks.com> > >>>>>> > wrote: > >>>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure, > >>>>>> > >> TestDataNodeVolumeFailureReporting, and > >>>>>> > >> TestDataNodeVolumeFailureToleration all remove executable > permissions > >>>>>> > from > >>>>>> > >> directories like the one Colin mentioned to simulate disk > failures at > >>>>>> > data > >>>>>> > >> nodes. I reviewed the code for all of those, and they all > appear to > >>>>>> be > >>>>>> > >> doing the necessary work to restore executable permissions at > the > >>>>>> end of > >>>>>> > >> the test. The only recent uncommitted patch I¹ve seen that > makes > >>>>>> > changes > >>>>>> > >> in these test suites is HDFS-7722. That patch still looks fine > >>>>>> > though. I > >>>>>> > >> don¹t know if there are other uncommitted patches that changed > these > >>>>>> > test > >>>>>> > >> suites. > >>>>>> > >> > >>>>>> > >> I suppose it¹s also possible that the JUnit process > unexpectedly died > >>>>>> > >> after removing executable permissions but before restoring > them. > >>>>>> That > >>>>>> > >> always would have been a weakness of these test suites, > regardless of > >>>>>> > any > >>>>>> > >> recent changes. > >>>>>> > >> > >>>>>> > >> Chris Nauroth > >>>>>> > >> Hortonworks > >>>>>> > >> http://hortonworks.com/ > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <a...@cloudera.com> > wrote: > >>>>>> > >> > >>>>>> > >>>Hey Colin, > >>>>>> > >>> > >>>>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's > going on > >>>>>> with > >>>>>> > >>>these boxes. He took a look and concluded that some perms are > being > >>>>>> set > >>>>>> > in > >>>>>> > >>>those directories by our unit tests which are precluding those > files > >>>>>> > from > >>>>>> > >>>getting deleted. He's going to clean up the boxes for us, but > we > >>>>>> should > >>>>>> > >>>expect this to keep happening until we can fix the test in > question > >>>>>> to > >>>>>> > >>>properly clean up after itself. > >>>>>> > >>> > >>>>>> > >>>To help narrow down which commit it was that started this, > Andrew > >>>>>> sent > >>>>>> > me > >>>>>> > >>>this info: > >>>>>> > >>> > >>>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS- > >>>>>> > > >>>>>> > >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/ > >>>>>> > has > >>>>>> > >>>500 perms, so I'm guessing that's the problem. Been that way > since > >>>>>> 9:32 > >>>>>> > >>>UTC > >>>>>> > >>>on March 5th." > >>>>>> > >>> > >>>>>> > >>>-- > >>>>>> > >>>Aaron T. Myers > >>>>>> > >>>Software Engineer, Cloudera > >>>>>> > >>> > >>>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe < > cmcc...@apache.org > >>>>>> > > >>>>>> > >>>wrote: > >>>>>> > >>> > >>>>>> > >>>> Hi all, > >>>>>> > >>>> > >>>>>> > >>>> A very quick (and not thorough) survey shows that I can't > find any > >>>>>> > >>>> jenkins jobs that succeeded from the last 24 hours. Most of > them > >>>>>> seem > >>>>>> > >>>> to be failing with some variant of this message: > >>>>>> > >>>> > >>>>>> > >>>> [ERROR] Failed to execute goal > >>>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean > >>>>>> (default-clean) > >>>>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed to > delete > >>>>>> > >>>> > >>>>>> > >>>> > >>>>>> > > >>>>>> > > >>>>>> > >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr > >>>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3 > >>>>>> > >>>> -> [Help 1] > >>>>>> > >>>> > >>>>>> > >>>> Any ideas how this happened? Bad disk, unit test setting > wrong > >>>>>> > >>>> permissions? > >>>>>> > >>>> > >>>>>> > >>>> Colin > >>>>>> > >>>> > >>>>>> > >> > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > -- > >>>>>> > > Lei (Eddy) Xu > >>>>>> > > Software Engineer, Cloudera > >>>>>> > > >>>>>> > >>>>> > >>>>> > >> > >> > >> > >> -- > >> Lei (Eddy) Xu > >> Software Engineer, Cloudera > -- Sean