Hi, there was another nightly stop that blowed up in HdfsCollectionsAPIDistributedZkTest. It ran out of disk space and then hung.
I killed the job and nuked the workspace to free disk. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Uwe Schindler [mailto:[email protected]] > Sent: Friday, August 19, 2016 9:50 AM > To: [email protected] > Cc: Dawid Weiss <[email protected]> > Subject: RE: Lucene/Solr Nightly Tests Master Build #1100 stuck? > > Hi, > > I talked with Steve via IRC yesterday. > > > This job finally finished. > > > > Looks like the full disk problem was triggered when writing logs out - the > > following is recorded in consoleText 1,478 times: > > > > ----- > > [junit4] java.io.IOException: No space left on device > > [junit4] at java.io.RandomAccessFile.writeBytes(Native Method) > > [junit4] at java.io.RandomAccessFile.write(RandomAccessFile.java:525) > > [junit4] at > > > com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$1.write(LocalSla > > veStreamHandler.java:74) > > [junit4] at > > > com.carrotsearch.ant.tasks.junit4.events.AppendStdErrEvent.copyTo(Appen > > dStdErrEvent.java:24) > > [junit4] at > > > com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler.pumpEvents(Loc > > alSlaveStreamHandler.java:252) > > [junit4] at > > > com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$2.run(LocalSlave > > StreamHandler.java:122) > > ----- > > > > I ssh’d into lucene1-us-west.apache.org, where the lucene Jenkins slave is > > hosted, to look at the disk space situation. > > > > ----- > > jenkins@lucene1-us-west:~$ df -k . > > Filesystem 1K-blocks Used Available Use% Mounted on > > /dev/sdb1 139204584 90449280 42237900 69% /x1 > > jenkins@lucene1-us-west:~$ df -k > > Filesystem 1K-blocks Used Available Use% > > Mounted on > > /dev/mapper/lucene1--us--west--vg-root 30582652 23554540 5451564 > > 82% / > > […] > > /dev/sdb1 139204584 90449280 42237900 69% /x1 > > ----- > > > > All Jenkins workspaces are under /x1/jenkins/. > > > > Separately (I think) I see that Uwe has got the enwiki.random.lines.txt file > > checked out multiple times (looks like once per job, of which there are > > currently 17, though I doubt all of them will need this file), so each copy > > is > > taking up 3GB: > > That's not true. The enwiki file is only part of 2 jobs, which actually do the > checkout as a separate Jenkins task. Those Jobs also run without security > manager so they can acess the file outside project dir. > > This may be the cause for the issues, too. It looks like some tests in Solr > does > not like the situation when security manager is switched off (- > Dtests.useSecurityManager=false). IMHO, I'd suggest to randomly also > disable security manager on Policeman builds (I know that Elasticsearch did > this in the past, too - before they made the whole server also use Security > Manager), so we have better tests also under real conditions. This would > uncover such bugs also under "normal" test runs. > > > ----- > > jenkins@lucene1-us-west:~/jenkins-slave$ ls -l workspace/*/test-data > > workspace/Lucene-Solr-NightlyTests-6.x/test-data: > > total 2966980 > > -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 16 03:18 > enwiki.random.lines.txt > > -rw-r--r-- 1 jenkins jenkins 452 Aug 16 03:18 README.txt > > > > workspace/Lucene-Solr-NightlyTests-master/test-data: > > total 2966980 > > -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 15 22:27 > enwiki.random.lines.txt > > -rw-r--r-- 1 jenkins jenkins 452 Aug 15 22:27 README.txt > > ----- > > > > Uwe, is there any way we can just have one copy shared by all jobs? > > I think that's not needed. > > > Here are the disk footprints by job: > > > > ----- > > jenkins@lucene1-us-west:~/jenkins-slave/workspace$ du -sh > > /x1/jenkins/jenkins-slave/workspace/* > > 28K /x1/jenkins/jenkins-slave/workspace/infra-test-ant-ubuntu > > 44K /x1/jenkins/jenkins-slave/workspace/infra-test-maven-ubuntu > > 971M /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-6.x > > 968M /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-master > > 368M /x1/jenkins/jenkins-slave/workspace/Lucene-Ivy-Bootstrap > > 6.5G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Clover-master > > 1.7G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-6.x > > 1.7G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-master > > 6.5G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x > > 56G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests- > > master > > 2.0G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease-6.x > > 2.0G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease- > > master > > 1.2G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-6.x > > 1.6G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master > > 468M /x1/jenkins/jenkins-slave/workspace/Lucene-Tests-MMAP-master > > 1.7G /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-6.x > > 1.7G /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-master > > ----- > > > > Turns out there is a single *45GB* file in the job with the largest disk > > footprint (also the job that started this thread) - under > > /home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/: > > > > solr/build/solr-core/test/temp/junit4-J2-20160817_095505_593.events > > > > Does anybody know if we can limit the size of these *.events files, which > > seem to be created under OOM conditions? > > > > I ran ‘rm -rf solr/build’ to reclaim the disk space. > > Thanks. I'd like to get hold of the issues with nightly tests. As far as I > see, all > builds failed with OOM conditions. The issue is that it created heap dumps > for debugging (mainly for issues in Lucene), those are filling the disk, too. > > If we cannot fix the OOM issues and the therefore steadily growing events > files, we should temporarily disable the nightly jobs and nuke their > workspace. > > Uwe > > > > On Aug 17, 2016, at 5:49 PM, Kevin Risden > <[email protected]> > > wrote: > > > > > > Usually the build takes 5-6 hours and now its been ~14hrs. > > > > > > https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1100 > > > > > > I saw in the console logs: > > > > > > java.security.PrivilegedActionException: java.io.IOException: No space > > > left > > on device > > > > > > Looks like it might be stuck here: > > > > > > Archiving artifacts > > > > > > Not sure if there is something that can be done about this? > > > > > > Kevin Risden > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
