On Tue, Mar 20, 2018 at 11:24 AM, Gal Ben Haim <[email protected]> wrote: > The failure happened again on "ovirt-srv04". > The suite wasn't run from "/dev/shm" since it was full of stale lago > environments of "hc-basic-suite-4.1" and "he-basic-iscsi-suite-4.2". > The reason for the stale envs is a timeout that was raised by Jenkins (the > suites were stuck for 6 hours), so OST's cleanup has not been called. > I'm going to add an internal timeout to OST. > > > On Tue, Mar 20, 2018 at 11:03 AM, Yedidyah Bar David <[email protected]> > wrote: >> >> On Tue, Mar 20, 2018 at 10:57 AM, Barak Korren <[email protected]> wrote: >> > On 20 March 2018 at 10:53, Yedidyah Bar David <[email protected]> wrote: >> >> On Tue, Mar 20, 2018 at 10:11 AM, Barak Korren <[email protected]> >> >> wrote: >> >>> On 20 March 2018 at 09:17, Yedidyah Bar David <[email protected]> wrote: >> >>>> On Mon, Mar 19, 2018 at 6:56 PM, Dominik Holler <[email protected]> >> >>>> wrote: >> >>>>> Thanks Gal, I expect the problem is fixed until something eats >> >>>>> all space in /dev/shm. >> >>>>> But the usage of /dev/shm is logged in the output, so we would be >> >>>>> able >> >>>>> to detect the problem next time instantly. >> >>>>> >> >>>>> From my point of view it would be good to know why /dev/shm was >> >>>>> full, >> >>>>> to prevent this situation in future. >> >>>> >> >>>> Gal already wrote below - it was because some build failed to clean >> >>>> up >> >>>> after itself. >> >>>> >> >>>> I don't know about this specific case, but I was told that I am >> >>>> personally causing such issues by using the 'cancel' button, so I >> >>>> sadly stopped. Sadly, because our CI system is quite loaded and when >> >>>> I >> >>>> know that some build is useless, I wish to kill it and save some >> >>>> load... >> >>>> >> >>>> Back to your point, perhaps we should make jobs check /dev/shm when >> >>>> they _start_, and either alert/fail/whatever if it's not almost free, >> >>>> or, if we know what we are doing, just remove stuff there? That might >> >>>> be much easier than fixing things to clean up in end, and/or >> >>>> debugging >> >>>> why this cleaning failed. >> >>> >> >>> Sure thing, patches to: >> >>> >> >>> [jenkins repo]/jobs/confs/shell-scripts/cleanup_slave.sh >> >>> >> >>> Are welcome, we often find interesting stuff to add there... >> >>> >> >>> If constrained for time, please turn this comment into an orderly RFE >> >>> in Jira... >> >> >> >> Searched for '/dev/shm' and found way too many places to analyze them >> >> all and add something to cleanup_slave to cover all. >> > >> > Where did you search? >> >> ovirt-system-tests, lago, lago-ost-plugin. >> ovirt-system-tests has 83 occurrences. I realize almost all are in >> lago guests, but looking still takes time... >> >> In theory I can patch cleanup_slave.sh as you suggested, removing >> _everything_ there. >> Not sure this is safe.
Well, pushed now: https://gerrit.ovirt.org/89225 >> >> > >> >> >> >> Pushed this for now: >> >> >> >> https://gerrit.ovirt.org/89215 >> >> >> >>> >> >>> -- >> >>> Barak Korren >> >>> RHV DevOps team , RHCE, RHCi >> >>> Red Hat EMEA >> >>> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted >> >> >> >> >> >> >> >> -- >> >> Didi >> > >> > >> > >> > -- >> > Barak Korren >> > RHV DevOps team , RHCE, RHCi >> > Red Hat EMEA >> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted >> >> >> >> -- >> Didi >> _______________________________________________ >> Infra mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/infra > > > > > -- > GAL bEN HAIM > RHV DEVOPS -- Didi _______________________________________________ Infra mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/infra
