RE: Lucene/Solr Nightly Tests Master Build #1100 stuck?

Uwe Schindler Tue, 23 Aug 2016 02:08:16 -0700

Hi,

there was another nightly stop that blowed up in 
HdfsCollectionsAPIDistributedZkTest.
It ran out of disk space and then hung.


I killed the job and nuked the workspace to free disk.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Uwe Schindler [mailto:[email protected]]
> Sent: Friday, August 19, 2016 9:50 AM
> To: [email protected]
> Cc: Dawid Weiss <[email protected]>
> Subject: RE: Lucene/Solr Nightly Tests Master Build #1100 stuck?
> 
> Hi,
> 
> I talked with Steve via IRC yesterday.
> 
> > This job finally finished.
> >
> > Looks like the full disk problem was triggered when writing logs out - the
> > following is recorded in consoleText 1,478 times:
> >
> > -----
> >   [junit4] java.io.IOException: No space left on device
> >    [junit4]     at java.io.RandomAccessFile.writeBytes(Native Method)
> >    [junit4]     at java.io.RandomAccessFile.write(RandomAccessFile.java:525)
> >    [junit4]     at
> >
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$1.write(LocalSla
> > veStreamHandler.java:74)
> >    [junit4]     at
> >
> com.carrotsearch.ant.tasks.junit4.events.AppendStdErrEvent.copyTo(Appen
> > dStdErrEvent.java:24)
> >    [junit4]     at
> >
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler.pumpEvents(Loc
> > alSlaveStreamHandler.java:252)
> >    [junit4]     at
> >
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$2.run(LocalSlave
> > StreamHandler.java:122)
> > -----
> >
> > I ssh’d into lucene1-us-west.apache.org, where the lucene Jenkins slave is
> > hosted, to look at the disk space situation.
> >
> > -----
> > jenkins@lucene1-us-west:~$ df -k .
> > Filesystem     1K-blocks     Used Available Use% Mounted on
> > /dev/sdb1      139204584 90449280  42237900  69% /x1
> > jenkins@lucene1-us-west:~$ df -k
> > Filesystem                             1K-blocks     Used Available Use% 
> > Mounted on
> > /dev/mapper/lucene1--us--west--vg-root  30582652 23554540   5451564
> > 82% /
> > […]
> > /dev/sdb1                              139204584 90449280  42237900  69% /x1
> > -----
> >
> > All Jenkins workspaces are under /x1/jenkins/.
> >
> > Separately (I think) I see that Uwe has got the enwiki.random.lines.txt file
> > checked out multiple times (looks like once per job, of which there are
> > currently 17, though I doubt all of them will need this file), so each copy 
> > is
> > taking up 3GB:
> 
> That's not true. The enwiki file is only part of 2 jobs, which actually do the
> checkout as a separate Jenkins task. Those Jobs also run without security
> manager so they can acess the file outside project dir.
> 
> This may be the cause for the issues, too. It looks like some tests in Solr 
> does
> not like the situation when security manager is switched off (-
> Dtests.useSecurityManager=false). IMHO, I'd suggest to randomly also
> disable security manager on Policeman builds (I know that Elasticsearch did
> this in the past, too - before they made the whole server also use Security
> Manager), so we have better tests also under real conditions. This would
> uncover such bugs also under "normal" test runs.
> 
> > -----
> > jenkins@lucene1-us-west:~/jenkins-slave$ ls -l workspace/*/test-data
> > workspace/Lucene-Solr-NightlyTests-6.x/test-data:
> > total 2966980
> > -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 16 03:18
> enwiki.random.lines.txt
> > -rw-r--r-- 1 jenkins jenkins        452 Aug 16 03:18 README.txt
> >
> > workspace/Lucene-Solr-NightlyTests-master/test-data:
> > total 2966980
> > -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 15 22:27
> enwiki.random.lines.txt
> > -rw-r--r-- 1 jenkins jenkins        452 Aug 15 22:27 README.txt
> > -----
> >
> > Uwe, is there any way we can just have one copy shared by all jobs?
> 
> I think that's not needed.
> 
> > Here are the disk footprints by job:
> >
> > -----
> > jenkins@lucene1-us-west:~/jenkins-slave/workspace$ du -sh
> > /x1/jenkins/jenkins-slave/workspace/*
> > 28K /x1/jenkins/jenkins-slave/workspace/infra-test-ant-ubuntu
> > 44K /x1/jenkins/jenkins-slave/workspace/infra-test-maven-ubuntu
> > 971M        /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-6.x
> > 968M        /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-master
> > 368M        /x1/jenkins/jenkins-slave/workspace/Lucene-Ivy-Bootstrap
> > 6.5G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Clover-master
> > 1.7G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-6.x
> > 1.7G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-master
> > 6.5G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x
> > 56G /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-
> > master
> > 2.0G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease-6.x
> > 2.0G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease-
> > master
> > 1.2G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-6.x
> > 1.6G        /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master
> > 468M        /x1/jenkins/jenkins-slave/workspace/Lucene-Tests-MMAP-master
> > 1.7G        /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-6.x
> > 1.7G        /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-master
> > -----
> >
> > Turns out there is a single *45GB* file in the job with the largest disk
> > footprint (also the job that started this thread) - under
> > /home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/:
> >
> >   solr/build/solr-core/test/temp/junit4-J2-20160817_095505_593.events
> >
> > Does anybody know if we can limit the size of these *.events files, which
> > seem to be created under OOM conditions?
> >
> > I ran ‘rm -rf solr/build’ to reclaim the disk space.
> 
> Thanks. I'd like to get hold of the issues with nightly tests. As far as I 
> see, all
> builds failed with OOM conditions. The issue is that it created heap dumps
> for debugging (mainly for issues in Lucene), those are filling the disk, too.
> 
> If we cannot fix the OOM issues and the therefore steadily growing events
> files, we should temporarily disable the nightly jobs and nuke their
> workspace.
> 
> Uwe
> 
> > > On Aug 17, 2016, at 5:49 PM, Kevin Risden
> <[email protected]>
> > wrote:
> > >
> > > Usually the build takes 5-6 hours and now its been ~14hrs.
> > >
> > > https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1100
> > >
> > > I saw in the console logs:
> > >
> > > java.security.PrivilegedActionException: java.io.IOException: No space 
> > > left
> > on device
> > >
> > > Looks like it might be stuck here:
> > >
> > > Archiving artifacts
> > >
> > > Not sure if there is something that can be done about this?
> > >
> > > Kevin Risden
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Lucene/Solr Nightly Tests Master Build #1100 stuck?

Reply via email to