On 21/04/15 12:37, Hui Xiang wrote:
> Thanks Christine.
>
> One more question, in the broken environment, we found part of the
> source code in libqb as below:
> 1)
> void *
> qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
> {
> uint32_t write_pt;
>
> if (rb == NULL) {
> errno = EINVAL;
> return NULL;
> }
> /*
> * Reclaim data if we are over writing and we need space
> */
> if (rb->flags & QB_RB_FLAG_OVERWRITE) {
> while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> *_rb_chunk_reclaim(rb);*
> }
> } else {
> if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> errno = EAGAIN;
> return NULL;
> }
> }
>
> but in the master branch:
> 2)
> while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> * int rc = _rb_chunk_reclaim(rb);*
> * if (rc != 0) {*
> * errno = rc;*
> * return NULL;*
> }
> }
>
>
> is it possible that the code 1) we have been stucked in the infinite
> loop of
> while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the
> condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function
> _rb_chunk_reclaim() just return:
> static void
> _rb_chunk_reclaim(struct qb_ringbuffer_s * rb)
> {
> uint32_t old_read_pt;
> uint32_t new_read_pt;
> uint32_t old_chunk_size;
> uint32_t chunk_magic;
>
> old_read_pt = rb->shared_hdr->read_pt;
> chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt);
> * if (chunk_magic != QB_RB_CHUNK_MAGIC) {*
> * return;*
> * }*
> *
> *
> *
> *
> and there is a commit seems fix it [1], do you know what's the
> background of this commit? does it look to fix it?
>
> Thanks again :)
I don't know enough about the background to that fix. What you're saying
sounds plausible but I can't be sure. There are quite a few stability
fixed in libqb 0.17 so it could be that or one of the others!
Chrissie
> [1]
> https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1
>
>
>
> On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield
> <[email protected] <mailto:[email protected]>> wrote:
>
> Hiya,
>
> It's hard to be sure without more information, sadly - if the backtrace
> looks similar to the one you mention then upgrading libqb to 0.17 should
> help.
>
> Chrissie
>
> On 21/04/15 07:12, Hui Xiang wrote:
> > Thanks Christine, sorry for responding late.
> >
> > I got this problem again, and corosync-blackbox just hang there, no
> > output. there are some other debug information for you guys.
> >
> > The backtrace and perf.data are very similar as link [1], but we don't
> > know what's the root cause, sure restart corosync is one of the
> > solution, but after a while it breaks again, so we'd like to find out
> > what's really going on there.
> >
> > Thanks for your efforts, very appreciated : )
> >
> > [1] http://www.spinics.net/lists/corosync/msg03445.html
> >
> >
> > On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield
> <[email protected] <mailto:[email protected]>
> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >
> > On 09/02/15 01:59, Hui Xiang wrote:
> > > Hi guys,
> > >
> > > I am having an issue with corosync where it consumes 100%
> cpu and hung on
> > > the command corosync-quorumtool -l, Recv-Q is very high in
> the meantime
> > > inside lxc container.
> > > corosync version : 2.3.3
> > >
> > > transport : unicast
> > >
> > > After setting up 3 keystone nodes with corosync/pacemaker,
> split brain
> > > happened, on one of the keystone nodes we found the cpu is
> 100% used by
> > > corosync.
> > >
> >
> >
> > It looks like it might be a problem I saw while doing some
> development
> > on corosync, if it gets a SEGV, there's a signal handler that
> catches it
> > and relays it back to libqb via a pipe, causing another SEGV and
> > corosync is then just spinning on the pipe for ever. The cause
> I saw is
> > not likely yo be the same as yours (it was my coding at the
> time ;-) but
> > it does sound like a similar effect. The only way round it is
> to kill
> > corosync and restart it. There might be something in the
> > corosync-blackbox to indicate what went wrong if that has been
> saved. If
> > you have that then please post it here so we can have a look.
> >
> > man corosync-blackbox
> >
> > Chrissie
> >
> > > **
> > >
> > > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
> > > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,
> 0.0 si,
> > 0.0 st
> > > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
> > > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312
> cached Mem
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43
> corosync
> > >
> > > From netstat output, one interesting finding is the Recv-Q size
> > has a value
> > > 320256, which is higher than normal.
> > > And after simply doing pkill -9 corosync and restart
> > corosync/pacemaker,
> > > the whole cluster are back normal.
> > >
> > > Active Internet connections (only servers)
> > > Proto Recv-Q Send-Q Local Address Foreign Address State
> > PID/Program name
> > > udp 320256 0 192.168.100.67:5434
> <http://192.168.100.67:5434> <http://192.168.100.67:5434>
> > 0.0.0.0:* 18637/corosync
> > >
> > > Udp:
> > > 539832 packets received
> > > 619 packets to unknown port received.
> > > 407249 packet receive errors
> > > 1007262 packets sent
> > > RcvbufErrors: 69940
> > >
> > > **
> > >
> > > So I am asking if there is any bug/issue related with corosync
> > may cause
> > > it slowly receive packets from socket and hung up due to some
> reason?
> > >
> > > Thanks a lot, looking forward for your response.
> > >
> > >
> > > Best Regards.
> > >
> > > Hui.
> > >
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list
> > > [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> > > http://lists.corosync.org/mailman/listinfo/discuss
> > >
> >
> > _______________________________________________
> > discuss mailing list
> > [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> > http://lists.corosync.org/mailman/listinfo/discuss
> >
> >
>
>
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss