> Since we cannot reproduce this, and we cannot easily stop using > repoman in OST at this point. We implemented a work-around for the > time being where we directed the master flow to run on a fixed set of > nodes that have A LOT of RAM [3].
Take into account that this will significantly make the suites run slower(+10 minutes), as iirc all those servers are multi-NUMA. Also something must be really exploding, because the basic suite does not take more than 10GB of ram, and most of the low memory servers have around 48GB. > filling up with files, instead, repoman`s memory usage was exploding > (20G+) to the point where there was not more memory available for use > by /dev/shm. I have a wild guess that this also happens because repoman does post-filtering, and it first downloads all packages, then filters them. About node and appliance, I think we should avoid downloading them, they are not used anywhere as far as I know. This filter should work(in extra_sources) last I checked, i.e.: rec:http://plain.resources.ovirt.org/repos/ovirt/tested/4.1/rpm/el7/:name~^(?!ovirt-node-ng-image|ovirt-engine-appliance).* If it goes in the groovy it will need some regex escaping love.. Though if my previous assumption is correct(post-filtering) it probably wouldn't matter. This raises the questions(again) of how do we filter stuff from repoman efficiently, without hiding them in 'extra_sources'. Nadav. On Wed, Feb 22, 2017 at 8:07 PM, Barak Korren <[email protected]> wrote: > Hi everyone, > > We've recently seen repeating errors where the OST 'master upgrade > from release' suit failed with a repoman exception. > Close analysis revealed that repoman was failing because it ran out of > space in /dev/shm (OST suites are configured to run fro, /dev/shm if > the slave has more then 16G available in it). > > The thing is, there is nothing that seems special about this suit and > the packages it downloads, but since we suspected package sizes we > opened OST-49 [1]. > > Trying to get more information we monitored a slave while it was > running the suit. We found out that it wasn't the /dev/shm that we > filling up with files, instead, repoman`s memory usage was exploding > (20G+) to the point where there was not more memory available for use > by /dev/shm. > As a result we reported REP-3 [2]. > > This is not happening all the time. The same suit sometimes succeeds > on the exact same slaves. We haven't yet managed to manually reproduce > this. > > Since we cannot reproduce this, and we cannot easily stop using > repoman in OST at this point. We implemented a work-around for the > time being where we directed the master flow to run on a fixed set of > nodes that have A LOT of RAM [3]. > > Needless to say this is not a long term solution. We need to somehow > manage to reproduce or gain insight on the problem. Alternatively we > can consider reworking the OST suites to not use repoman for > downloading, but still use it for local repo building (Where its > unique properties are crucial). > > [1]: https://ovirt-jira.atlassian.net/browse/OST-49 > [2]: https://ovirt-jira.atlassian.net/browse/REP-3 > [3]: http://jenkins.ovirt.org/label/integ-tests-big/ > > -- > Barak Korren > [email protected] > RHCE, RHCi, RHV-DevOps Team > https://ifireball.wordpress.com/ > _______________________________________________ > Infra mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/infra _______________________________________________ Infra mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/infra
