I just realized what it is. The way killall is used when stopping a vstart cluster, is to kill all processes by name! You can't stop vstarted tests running in parallel.
David Zafman Senior Developer http://www.inktank.com > On Oct 21, 2014, at 7:55 PM, Loic Dachary <[email protected]> wrote: > > Hi, > > Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. > Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process > gets killed from time to time. For instance it shows as > > TEST_erasure_crush_stripe_width: 124: stripe_width=4096 > TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 > 12 erasure > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > ./test/mon/osd-pool-create.sh: line 120: 27557 Killed ./ceph > osd pool create pool_erasure 12 12 erasure > TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump > TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json > > in the test logs. Note the 27557 Killed . I originally thought it was because > some ulimit was crossed and set them to very generous / unlimited hard / soft > thresholds. > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 515069 > > max locked memory (kbytes, -l) unlimited > > max memory size (kbytes, -m) unlimited > > open files (-n) 400000 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) unlimited > > cpu time (seconds, -t) unlimited > > max user processes (-u) unlimited > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > Benoit Canet suggested that I installed systemtap ( > https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran > https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what > was sending the kill signal. It showed the following: > > ... > SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001 > SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001 > .... > > which suggests that pid 27557 used by ceph-osd was reused for the python > script that was killed above. Because the script that kills daemons is very > agressive and kill -9 the pid to check if it really is dead > > https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64 > > it explains the problem. > > However, as Dan Mick suggests, reusing pid quickly could break a number of > things and it is a surprising behavior. Maybe something else is going on. A > loop creating processes sees their pid increasing and not being reused. > > Any idea about what is going on would be much appreciated :-) > > Cheers > > -- > Loïc Dachary, Artisan Logiciel Libre > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
