Good catch! On Tue, Apr 7, 2015 at 1:02 AM, Dave Brondsema <[email protected]> wrote:
> I started running into this issue again, but believe I have got a fix for > it now > too. > > The fix ticket is at > https://forge-allura.apache.org/p/allura/tickets/7869/ > Here's how I figured it out: > > On a fresh centos VM ./run_tests -p Allura worked fine but on a > "well-used" VM > it would lock up at 100% CPU. So there are definitely some environmental > aspects that I did not dive into to figure out. But what I did was run the > tests in a single process (just nosetests, no ./run_tests helper). I also > ran > `pip uninstall nose_xunitmp` since that plugin (even when not activated) as > spawning a secondary process. > > Then I used http://pyrasite.readthedocs.org/en/latest/ to attach in to the > running python process because we needed to know what it was doing using > up 100% > CPU. I used pyrasite's thread dump example to see what each thread(s) was > doing > in the process. > > http://pyrasite.readthedocs.org/en/latest/Payloads.html#dumping-thread-stacks > Sometimes I didn't see anything helpful, but once I saw a very > recursive-looking > call going through `lookup_template_engine` in `patches.py` many many > times. I > commented that monkeypatch out and ran the tests again and they completed > but > with a max recursion error on the `encode` monkeypatched method. > Commented that > one out and the tests completed with just a few test failures (presumably > due to > monkeypatches not being applied). Then I saw the comment at the end of > `patches.py` describing patches being applied too many times during > tests. I > confirmed that was happening by restoring the monkeypatch code and logging > the > stack depth `len(traceback.extract_stack())` from inside a monkeypatched > method. > It would start around 60 and slowly creep up into the mid-100s. Perhaps > it > would've gone higher if I let it keep running. With the fix to apply > monkeypatches only once, it stays right around 60. > > I don't know why this would cause some environments to completely "lock > up". > Maybe they didn't and it just seemed like it. I would've expected a max > recursion error to be thrown. But it does seem to fix it and it explains > the > problems were related to a certain number of tests run in a process and > that > multiprocessing avoided it by dividing the test suites into multiple > pieces. > > Surprisingly, I didn't notice any speedups with this change, comparing a > ./run_tests multiprocessing on master and ./run_tests multiprocessing with > monkeypatch-once logic. > > On 9/30/14 7:48 AM, Igor Bondarenko wrote: > > We're currently using `./run_tests -n 2 -m 4` on a single core machines > and > > that works fine for us > > > > On Mon, Sep 29, 2014 at 5:44 PM, Dave Brondsema <[email protected]> > wrote: > > > >> Seems like there's no easy fix for this :( > >> > >> Since the workaround we're adopting is to run the Allura package's tests > >> with > >> nosetests --processes=2 (or more) we should probably force ./run_tests > to > >> do > >> that so it doesn't cause problems for somebody trying out Allura on a > >> single > >> core machine or VM. Any downside to that? > >> > >> On 9/25/14 12:41 AM, Alex Luberg wrote: > >>> I have discovered that the suite passed with 756 tests, and if I added > >>> another test(just copied some existing one with a different name) it > >> locked > >>> up at some test(which was not the one i've copied). I suspect that it > is > >>> not related to the actual test code, but something with > >> nose/python/sandbox. > >>> > >>> On Mon, Sep 22, 2014 at 3:40 AM, Igor Bondarenko <[email protected]> > >> wrote: > >>> > >>>> On Sat, Sep 20, 2014 at 12:25 AM, Dave Brondsema <[email protected]> > >>>> wrote: > >>>> > >>>>> On 9/19/14 12:18 PM, Dave Brondsema wrote: > >>>>>> Starting with Igor's comments on > >>>>> https://sourceforge.net/p/allura/tickets/7657/#c7d9 > >>>>>> > >>>>>>> There's a couple of new tests commented out in a last commit. I > can't > >>>>> figure out why, but they cause allura/tests/test_dispatch.py to hang > >> when > >>>>> run together with other tests. Also I have added and then removed > tests > >>>> for > >>>>> enable/disable user for the same reason. > >>>>>>> > >>>>>>> I think it needs another pair of eyes on it, since I've already > spent > >>>>> too much time dealing with this tests and have no idea what's > >>>> happening... > >>>>> Maybe I'm missing something obvious. > >>>>>> > >>>>>> Alex and I have seen this recently too, and its hard to figure out > >> what > >>>>> exactly > >>>>>> is the problem. I first noticed it when running `./run_tests > >>>>> --with-coverage` > >>>>>> which would run nosetests in the Allura dir and would not use > >>>>> --processes=N > >>>>>> because of the with-coverage param. So basically just a regular run > >> of > >>>>> the > >>>>>> tests in the Allura dir would cause the CPU to go into 100% usage > and > >>>>> the tests > >>>>>> wouldn't finish. Couldn't ctrl-C or profile them, had to kill -9 > it. > >>>>>> > >>>>>> That was on Centos 5.10 and a workaround was to run with > --processes=N > >>>>> and then > >>>>>> the tests would finish fine. On the Ubuntu vagrant image, I didn't > >>>>> encounter > >>>>>> any problem in the first place. So perhaps related to the > >> environment. > >>>>>> > >>>>>> I tried to narrow down to a specific test that might be the culprit. > >> I > >>>>> found > >>>>>> tests consistently got up to TestSecurity.test_auth (which is a bit > >>>>> weird and > >>>>>> old test anyway). And also that commenting out that test let them > all > >>>>> pass. > >>>>>> > >>>>>> But I'm pretty sure Alex said he dug into this as well and found > >>>>> variation in > >>>>>> what tests could cause the problem. I think he told me that going > >> back > >>>>> in > >>>>>> git-history before the problem, and then adding a single test (a > copy > >>>> of > >>>>> an > >>>>>> existing one) caused the problem. So perhaps some limit, or > resource > >>>>> tipping > >>>>>> point is hit. > >>>>>> > >>>>>> Alex or Igor, any more data points you know from what you've seen? > >>>>>> > >>>>>> Anyone else seen anything like this? Or have ideas for how to > >> approach > >>>>> nailing > >>>>>> it down better? > >>>>>> > >>>>>> > >>>>> > >>>>> I tried checking out branch je/42cc_7657 and going back to commit > >>>>> 4cc63586e5728d7d0c5c2f09150eb07eb7e4edc1 (before tests were commented > >>>> out) > >>>>> to > >>>>> see what happened for me: > >>>>> > >>>>> On vagrant / ubuntu, it froze at test_dispatch.py same as you. So > some > >>>>> consistency there. Tests passed when I ran `nosetests > >>>>> --process-timeout=180 > >>>>> --processes=4 -v` in the Allura dir. Seemed slow at the end though, > >>>> almost > >>>>> thought it froze. > >>>>> > >>>>> On centos, it froze at a different spot with a regular nosetests run. > >> It > >>>>> passed > >>>>> with `nosetests allura/tests/ --processes=4 --process-timeout=180 > -v`. > >>>>> For some > >>>>> reason (hopefully unrelated), I needed to specify path > "allura/tests/" > >> to > >>>>> avoid > >>>>> an IOError from multiprocessing. > >>>>> > >>>>> So at least multiprocess tests still seems like a workaround for me. > >>>> Note: > >>>>> ./run_tests picks a --processes=N value dynamically based on the > >>>> machine's > >>>>> CPU > >>>>> cores, so with a single core you don't get multiple processes that > way. > >>>>> Also > >>>>> note: if you have nose-progressive installed and active, that is > >>>>> incompatible > >>>>> with multiple processes. > >>>>> > >>>>> > >>>> It works exactly as you described for me too. > >>>> > >>>> I've reverted some commits with those tests, since problem not with > them > >>>> and they are useful > https://sourceforge.net/p/allura/tickets/7657/#8c06 > >>>> and > >>>> also made a fix in 42cc's Makefile (commited directly in master), so > >> that > >>>> it would always run tests in parallel (turns out here at 42cc we have > >>>> single core CPUs on boxes that run tests, that's why I had locks on > our > >> CI > >>>> also :( ) > >>>> > >>> > >> > >> > >> > >> -- > >> Dave Brondsema : [email protected] > >> http://www.brondsema.net : personal > >> http://www.splike.com : programming > >> <>< > >> > > > > > > -- > Dave Brondsema : [email protected] > http://www.brondsema.net : personal > http://www.splike.com : programming > <>< >
