Hi Steve, Thanks for your help. I've tried out your patch and confirmed it fixes the problem.
Cheers, Tim On Fri, Jul 8, 2011 at 10:36 AM, Steven Dake <[email protected]> wrote: > On 07/07/2011 03:07 PM, Tim Beale wrote: >> Hi Steve, >> >> Thanks for your help. When we upgraded to v1.3.1 we picked up commit >> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2: >>>> totemsrp: free messages originated in recovery rather then rely on >>>> messages_free >> >> Which is why I was retesting this issue. But I still see the problem even >> with >> the above change. >> >> The recovery code seems to work most of the time. But occasionally it doesn't >> free all of the recovery messages on the queue. It seems there are recovery >> messages left with seq numbers higher than instance->my_high_delivered/ >> instance->my_aru. >> >> In the last crash I saw there were 12 messages on the recovery queue but only >> 5 of them got freed by the above patch/code. I think usually a node leave >> event >> seems to occur at the same time. >> > > I speculate there are gaps in the recovery queue. Example my_aru = 5, > but there are messages at 7,8. 8 = my_high_seq_received which results > in data slots taken up in new message queue. What should really happen > is these last messages should be delivered after a transitional > configuration to maintain SAFE agreement. We don't have support for > SAFE atm, so it is probably safe just to throw these messages away. > > Could you test my speculatory patch against your test case? > > Thanks! > -steve > >> I can reproduce the problem reasonably reliably in a 2-node cluster with: >> #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40 >> #define TEST_DROP_MCAST_PERCENTAGE 20 >> But I suspect it's reliant on timing/messaging specific to my system. Let me >> know if there's any debug or anything you want me to try out. >> >> Thanks, >> Tim >> >> On Thu, Jul 7, 2011 at 3:47 PM, Steven Dake <[email protected]> wrote: >>> On 07/06/2011 05:24 PM, Tim Beale wrote: >>>> Hi, >>>> >>>> We've hit a problem in the recovery code and I'm struggling to understand >>>> why >>>> we do the following: >>>> >>>> /* >>>> * The recovery sort queue now becomes the regular >>>> * sort queue. It is necessary to copy the state >>>> * into the regular sort queue. >>>> */ >>>> sq_copy (&instance->regular_sort_queue, >>>> &instance->recovery_sort_queue); >>>> >>>> The problem we're seeing is sometimes we get an encapsulated message from >>>> the >>>> recovery queue copied onto the regular queue, and corosync then crashes >>>> trying >>>> to process the message. (When it strips off the totemsrp header it gets >>>> another >>>> totemsrp header rather than the totempg header it expects). >>>> >>>> The problem seems to happen when we only do the sq_items_release() for a >>>> subset >>>> of the recovery messages, e.g. there are 12 messages on the recovery queue >>>> and >>>> we only free/release 5 of them. The remaining encapsulated recovery >>>> messages >>>> get left on the regular queue and corosync crashes trying to deliver them. >>>> >>>> It looks to me like deliver_messages_from_recovery_to_regular() handles the >>>> encapsulation correctly, stripping the extra header and adding the recovery >>>> messages to the regular queue. But then the sq_copy() just seems to >>>> overwrite >>>> the regular queue. >>>> >>>> We've avoided the crash in the past by just reiniting both queues, but I >>>> don't >>>> think this is the best solution. >>>> >>> >>> I would expect this solution would lead to message loss or lockup of the >>> protocol. >>> >>>> Any advice would be appreciated. >>>> >>>> Thanks, >>>> Tim >>> >>> A proper fix should be in commit >>> master: >>> 7d5e588931e4393c06790995a995ea69e6724c54 >>> flatiron-1.3: >>> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2 >>> >>> A new flatiron-1.3 release is in the works. There are other totem bugs >>> you may wish to backport in the meantime. >>> >>> Let us know if that commit fixes the problem you encountered. >>> >>> Regards >>> -steve >>> >>>> _______________________________________________ >>>> Openais mailing list >>>> [email protected] >>>> https://lists.linux-foundation.org/mailman/listinfo/openais >>> >>> > > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
