Hi,

I talked with Steve via IRC yesterday.

> This job finally finished.
> 
> Looks like the full disk problem was triggered when writing logs out - the
> following is recorded in consoleText 1,478 times:
> 
> -----
>   [junit4] java.io.IOException: No space left on device
>    [junit4]     at java.io.RandomAccessFile.writeBytes(Native Method)
>    [junit4]     at java.io.RandomAccessFile.write(RandomAccessFile.java:525)
>    [junit4]     at
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$1.write(LocalSla
> veStreamHandler.java:74)
>    [junit4]     at
> com.carrotsearch.ant.tasks.junit4.events.AppendStdErrEvent.copyTo(Appen
> dStdErrEvent.java:24)
>    [junit4]     at
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler.pumpEvents(Loc
> alSlaveStreamHandler.java:252)
>    [junit4]     at
> com.carrotsearch.ant.tasks.junit4.LocalSlaveStreamHandler$2.run(LocalSlave
> StreamHandler.java:122)
> -----
> 
> I ssh’d into lucene1-us-west.apache.org, where the lucene Jenkins slave is
> hosted, to look at the disk space situation.
> 
> -----
> jenkins@lucene1-us-west:~$ df -k .
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sdb1      139204584 90449280  42237900  69% /x1
> jenkins@lucene1-us-west:~$ df -k
> Filesystem                             1K-blocks     Used Available Use% 
> Mounted on
> /dev/mapper/lucene1--us--west--vg-root  30582652 23554540   5451564
> 82% /
> […]
> /dev/sdb1                              139204584 90449280  42237900  69% /x1
> -----
> 
> All Jenkins workspaces are under /x1/jenkins/.
> 
> Separately (I think) I see that Uwe has got the enwiki.random.lines.txt file
> checked out multiple times (looks like once per job, of which there are
> currently 17, though I doubt all of them will need this file), so each copy is
> taking up 3GB:

That's not true. The enwiki file is only part of 2 jobs, which actually do the 
checkout as a separate Jenkins task. Those Jobs also run without security 
manager so they can acess the file outside project dir.

This may be the cause for the issues, too. It looks like some tests in Solr 
does not like the situation when security manager is switched off 
(-Dtests.useSecurityManager=false). IMHO, I'd suggest to randomly also disable 
security manager on Policeman builds (I know that Elasticsearch did this in the 
past, too - before they made the whole server also use Security Manager), so we 
have better tests also under real conditions. This would uncover such bugs also 
under "normal" test runs.

> -----
> jenkins@lucene1-us-west:~/jenkins-slave$ ls -l workspace/*/test-data
> workspace/Lucene-Solr-NightlyTests-6.x/test-data:
> total 2966980
> -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 16 03:18 enwiki.random.lines.txt
> -rw-r--r-- 1 jenkins jenkins        452 Aug 16 03:18 README.txt
> 
> workspace/Lucene-Solr-NightlyTests-master/test-data:
> total 2966980
> -rw-r--r-- 1 jenkins jenkins 3038178822 Aug 15 22:27 enwiki.random.lines.txt
> -rw-r--r-- 1 jenkins jenkins        452 Aug 15 22:27 README.txt
> -----
> 
> Uwe, is there any way we can just have one copy shared by all jobs?

I think that's not needed.

> Here are the disk footprints by job:
> 
> -----
> jenkins@lucene1-us-west:~/jenkins-slave/workspace$ du -sh
> /x1/jenkins/jenkins-slave/workspace/*
> 28K   /x1/jenkins/jenkins-slave/workspace/infra-test-ant-ubuntu
> 44K   /x1/jenkins/jenkins-slave/workspace/infra-test-maven-ubuntu
> 971M  /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-6.x
> 968M  /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-master
> 368M  /x1/jenkins/jenkins-slave/workspace/Lucene-Ivy-Bootstrap
> 6.5G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Clover-master
> 1.7G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-6.x
> 1.7G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Maven-master
> 6.5G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x
> 56G   /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-
> master
> 2.0G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease-6.x
> 2.0G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-SmokeRelease-
> master
> 1.2G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-6.x
> 1.6G  /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master
> 468M  /x1/jenkins/jenkins-slave/workspace/Lucene-Tests-MMAP-master
> 1.7G  /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-6.x
> 1.7G  /x1/jenkins/jenkins-slave/workspace/Solr-Artifacts-master
> -----
> 
> Turns out there is a single *45GB* file in the job with the largest disk
> footprint (also the job that started this thread) - under
> /home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/:
> 
>   solr/build/solr-core/test/temp/junit4-J2-20160817_095505_593.events
> 
> Does anybody know if we can limit the size of these *.events files, which
> seem to be created under OOM conditions?
> 
> I ran ‘rm -rf solr/build’ to reclaim the disk space.

Thanks. I'd like to get hold of the issues with nightly tests. As far as I see, 
all builds failed with OOM conditions. The issue is that it created heap dumps 
for debugging (mainly for issues in Lucene), those are filling the disk, too.

If we cannot fix the OOM issues and the therefore steadily growing events 
files, we should temporarily disable the nightly jobs and nuke their workspace.

Uwe

> > On Aug 17, 2016, at 5:49 PM, Kevin Risden <[email protected]>
> wrote:
> >
> > Usually the build takes 5-6 hours and now its been ~14hrs.
> >
> > https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1100
> >
> > I saw in the console logs:
> >
> > java.security.PrivilegedActionException: java.io.IOException: No space left
> on device
> >
> > Looks like it might be stuck here:
> >
> > Archiving artifacts
> >
> > Not sure if there is something that can be done about this?
> >
> > Kevin Risden
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to