Thanks Christine, sorry for responding late. I got this problem again, and corosync-blackbox just hang there, no output. there are some other debug information for you guys.
The backtrace and perf.data are very similar as link [1], but we don't know what's the root cause, sure restart corosync is one of the solution, but after a while it breaks again, so we'd like to find out what's really going on there. Thanks for your efforts, very appreciated : ) [1] http://www.spinics.net/lists/corosync/msg03445.html On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <[email protected]> wrote: > On 09/02/15 01:59, Hui Xiang wrote: > > Hi guys, > > > > I am having an issue with corosync where it consumes 100% cpu and hung > on > > the command corosync-quorumtool -l, Recv-Q is very high in the meantime > > inside lxc container. > > corosync version : 2.3.3 > > > > transport : unicast > > > > After setting up 3 keystone nodes with corosync/pacemaker, split brain > > happened, on one of the keystone nodes we found the cpu is 100% used by > > corosync. > > > > > It looks like it might be a problem I saw while doing some development > on corosync, if it gets a SEGV, there's a signal handler that catches it > and relays it back to libqb via a pipe, causing another SEGV and > corosync is then just spinning on the pipe for ever. The cause I saw is > not likely yo be the same as yours (it was my coding at the time ;-) but > it does sound like a similar effect. The only way round it is to kill > corosync and restart it. There might be something in the > corosync-blackbox to indicate what went wrong if that has been saved. If > you have that then please post it here so we can have a look. > > man corosync-blackbox > > Chrissie > > > ** > > > > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie > > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers > > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync > > > > From netstat output, one interesting finding is the Recv-Q size has a > value > > 320256, which is higher than normal. > > And after simply doing pkill -9 corosync and restart corosync/pacemaker, > > the whole cluster are back normal. > > > > Active Internet connections (only servers) > > Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name > > udp 320256 0 192.168.100.67:5434 0.0.0.0:* 18637/corosync > > > > Udp: > > 539832 packets received > > 619 packets to unknown port received. > > 407249 packet receive errors > > 1007262 packets sent > > RcvbufErrors: 69940 > > > > ** > > > > So I am asking if there is any bug/issue related with corosync may > cause > > it slowly receive packets from socket and hung up due to some reason? > > > > Thanks a lot, looking forward for your response. > > > > > > Best Regards. > > > > Hui. > > > > > > > > _______________________________________________ > > discuss mailing list > > [email protected] > > http://lists.corosync.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list > [email protected] > http://lists.corosync.org/mailman/listinfo/discuss >
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
