Hi David,

On 22/10/2014 15:21, David Zafman wrote:> 
> I just realized what it is.  The way killall is used when stopping a vstart 
> cluster, is to kill all processes by name!  You can't stop vstarted tests 
> running in parallel.

I discovered this indeed. But then instead of using ./stop.sh I use

https://github.com/dachary/ceph/blob/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d/src/test/mon/mon-test-helpers.sh#L62

in the context of 

https://github.com/dachary/ceph/blob/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d/src/test/vstart_wrapper.sh#L28

which makes it kill only the processes with a pid file in the relevant 
directory. The problem bellow showed because it was doing an aggressive kill -9 
to check if the process still exists.

https://github.com/dachary/ceph/commit/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d

Now that it's replaced with a kill -0 all is well. 

For the record the problem can be reliably reproduced by running make -j8 check 
from 
https://github.com/dachary/ceph/commit/c02bb8a5afef8669005c78b2b4f2f762cda4ee73 
and waiting less than one hour and probably more than 30 minutes on a 24 core, 
64GB RAM, 250GB SSD disk.

Cheers


> 
> David Zafman
> Senior Developer
> http://www.inktank.com
> 
> 
> 
> 
>> On Oct 21, 2014, at 7:55 PM, Loic Dachary <[email protected]> wrote:
>>
>> Hi,
>>
>> Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. 
>> Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process 
>> gets killed from time to time. For instance it shows as
>>
>> TEST_erasure_crush_stripe_width: 124: stripe_width=4096
>> TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 
>> 12 erasure
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> ./test/mon/osd-pool-create.sh: line 120: 27557 Killed                  
>> ./ceph osd pool create pool_erasure 12 12 erasure
>> TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump
>> TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json
>>
>> in the test logs. Note the 27557 Killed . I originally thought it was 
>> because some ulimit was crossed and set them to very generous / unlimited 
>> hard / soft thresholds.
>>
>> core file size          (blocks, -c) 0                                       
>>                                               
>> data seg size           (kbytes, -d) unlimited                               
>>                                               
>> scheduling priority             (-e) 0                                       
>>                                               
>> file size               (blocks, -f) unlimited                               
>>                                               
>> pending signals                 (-i) 515069                                  
>>                                               
>> max locked memory       (kbytes, -l) unlimited                               
>>                                               
>> max memory size         (kbytes, -m) unlimited                               
>>                                               
>> open files                      (-n) 400000                                  
>>                                               
>> pipe size            (512 bytes, -p) 8                                       
>>                                               
>> POSIX message queues     (bytes, -q) 819200                                  
>>                                               
>> real-time priority              (-r) 0                                       
>>                                               
>> stack size              (kbytes, -s) unlimited                               
>>                                               
>> cpu time               (seconds, -t) unlimited                               
>>                                               
>> max user processes              (-u) unlimited                               
>>                                               
>> virtual memory          (kbytes, -v) unlimited                               
>>                                               
>> file locks                      (-x) unlimited    
>>
>> Benoit Canet suggested that I installed systemtap ( 
>> https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran 
>> https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what 
>> was sending the kill signal. It showed the following:
>>
>> ...
>> SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001
>> SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001
>> ....
>>
>> which suggests that pid 27557 used by ceph-osd was reused for the python 
>> script that was killed above. Because the script that kills daemons is very 
>> agressive and kill -9 the pid to check if it really is dead
>>
>> https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64
>>
>> it explains the problem.
>>
>> However, as Dan Mick suggests, reusing pid quickly could break a number of 
>> things and it is a surprising behavior. Maybe something else is going on. A 
>> loop creating processes sees their pid increasing and not being reused.
>>
>> Any idea about what is going on would be much appreciated :-)
>>
>> Cheers
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to