Hi list,

There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very 
familiar with the corosync, so please correct me if I am wrong. The steps are 
following:

1.Make sure the corosync debug is off
2.Start openais on every node, and all of them are ok.
3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list 
started growing.

I got a piece of log from one node via corosync-blackbox:

rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd
rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd
rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to 
pending delivery queue 
rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb
rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd
rec=[79229] Log Message=got quorate request on 0x6d0980
rec=[79230] Log Message=got quorate request on 0x6d0980
rec=[79231] Log Message=Retransmit List 1
rec=[79232] Log Message=Retransmit List: 201 
rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue 
rec=[79234] Log Message=Retransmit List 1
rec=[79235] Log Message=Retransmit List: 201 
rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205
rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79239] Log Message=Retransmit List 1
rec=[79240] Log Message=Retransmit List: 201 
rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79242] Log Message=Retransmit List 2
rec=[79243] Log Message=Retransmit List: 201 202 
rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79245] Log Message=Retransmit List 2
rec=[79246] Log Message=Retransmit List: 201 202

There is a piece of code in exec/totemsrp.c:

3775         if (range) {
3776                 TRACE1 ("Delivering %x to %x\n", 
instance->my_high_delivered,
3777                         end_point);
3778         }

...

3785         for (i = 1; i <= range; i++) {
3786 
3787                 void *ptr = 0;
3788 
3789                 /*
3790                  * If out of range of sort queue, stop assembly
3791                  */
3792                 res = sq_in_range (&instance->regular_sort_queue,
3793                         my_high_delivered_stored + i);
3794                 if (res == 0) {
3795                         break;
3796                 }
3797 
3798                 res = sq_item_get (&instance->regular_sort_queue,
3799                         my_high_delivered_stored + i, &ptr);
3800                 /*
3801                  * If hole, stop assembly
3802                  */
3803                 if (res != 0 && skip == 0) {
3804                         break;
3805                 }
3806 
3807                 instance->my_high_delivered = my_high_delivered_stored + i;

...

3841                 /*
3842                  * Message found
3843                  */
3844                 TRACE1 ("Delivering MCAST message with seq %x to pending 
delivery queue\n",
3845                         mcast_header.seq);

>From these log and code, We could know that the message 1fe 1ff 200 have not 
>been delivered and it should jump out of the loop through the two break 
>sentences.

The first if only check the seq id range, and the second one should be the most 
suspect.

include/corosync/sq.h:

264 static inline unsigned int sq_item_get (
265         const struct sq *sq,
266         unsigned int seq_id,
267         void **sq_item_out)

...

286         if (sq->items_inuse[sq_position] == 0) {
287                 return (ENOENT);
288         }
    
I think the items_inuse array maybe cleared sometimes, and it return 0 when we 
access it. However, I couldn't study deep in more, so could anyone give me some 
hints?

-- 
Best regards,
Guangliang
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to