On Tue, Mar 19, 2013 at 07:44:21AM -0700, Steven Dake wrote:
> On 03/19/2013 03:18 AM, Guangliang Zhao wrote:
> >Hi list,
Hi Steven,
Thanks for your reply.
> >
> >There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very
> >familiar with the corosync, so please correct me if I am wrong. The steps
> >are following:
> >
> >1.Make sure the corosync debug is off
> >2.Start openais on every node, and all of them are ok.
> >3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list
> >started growing.
> >
> >I got a piece of log from one node via corosync-blackbox:
> >
> >rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd
> >rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd
> >rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to
> >pending delivery queue
> >rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb
> >rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd
> >rec=[79229] Log Message=got quorate request on 0x6d0980
> >rec=[79230] Log Message=got quorate request on 0x6d0980
> >rec=[79231] Log Message=Retransmit List 1
> >rec=[79232] Log Message=Retransmit List: 201
> >rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue
> >rec=[79234] Log Message=Retransmit List 1
> >rec=[79235] Log Message=Retransmit List: 201
> >rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205
> >rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79239] Log Message=Retransmit List 1
> >rec=[79240] Log Message=Retransmit List: 201
> >rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79242] Log Message=Retransmit List 2
> >rec=[79243] Log Message=Retransmit List: 201 202
> >rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79245] Log Message=Retransmit List 2
> >rec=[79246] Log Message=Retransmit List: 201 202
> >
> >There is a piece of code in exec/totemsrp.c:
> >
> >3775 if (range) {
> >3776 TRACE1 ("Delivering %x to %x\n",
> >instance->my_high_delivered,
> >3777 end_point);
> >3778 }
> >
> >...
> >
> >3785 for (i = 1; i <= range; i++) {
> >3786
> >3787 void *ptr = 0;
> >3788
> >3789 /*
> >3790 * If out of range of sort queue, stop assembly
> >3791 */
> >3792 res = sq_in_range (&instance->regular_sort_queue,
> >3793 my_high_delivered_stored + i);
> >3794 if (res == 0) {
> >3795 break;
> >3796 }
> >3797
> >3798 res = sq_item_get (&instance->regular_sort_queue,
> >3799 my_high_delivered_stored + i, &ptr);
> >3800 /*
> >3801 * If hole, stop assembly
> >3802 */
> >3803 if (res != 0 && skip == 0) {
> >3804 break;
> >3805 }
> >3806
> >3807 instance->my_high_delivered = my_high_delivered_stored
> >+ i;
> >
> >...
> >
> >3841 /*
> >3842 * Message found
> >3843 */
> >3844 TRACE1 ("Delivering MCAST message with seq %x to
> >pending delivery queue\n",
> >3845 mcast_header.seq);
> >
> > From these log and code, We could know that the message 1fe 1ff 200 have
> > not been delivered and it should jump out of the loop through the two break
> > sentences.
> >
> >The first if only check the seq id range, and the second one should be the
> >most suspect.
> >
> >include/corosync/sq.h:
> >
> >264 static inline unsigned int sq_item_get (
> >265 const struct sq *sq,
> >266 unsigned int seq_id,
> >267 void **sq_item_out)
> >
> >...
> >
> >286 if (sq->items_inuse[sq_position] == 0) {
> >287 return (ENOENT);
> >288 }
> >I think the items_inuse array maybe cleared sometimes, and it return 0 when
> >we access it. However, I couldn't study deep in more, so could anyone give
> >me some hints?
> >
>
> items_inuse[sq_position] should contain zero if there is no entry.
> If there is no entry, we want to stop processing in the above code
> because it is a hole in the messages.
If we want skip the hole in the messages, I think the my_high_delivered
or more parameters should be updated, but didn't, so it always try to
deliver the messages from my_high_delivered + 1, but couldn't success,
because the my_high_delivered + 1 message is a hole?
I collected the result of corosync-blackbox from one of the nodes, but it is a
pretty big log, I would add it as an attachment next mail if you need.
>
> The sort queue is a circular array which is cleared as
> sq_item_release is called. This should only occur after the message
> has been delivered to all nodes on the ring in
> totemsrp.c:messages_free.
>
> Regards
> -steve
>
--
Best regards,
Guangliang
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss