Hi Steve, Thanks for your help. When we upgraded to v1.3.1 we picked up commit 8603ff6e9a270ecec194f4e13780927ebeb9f5b2: >> totemsrp: free messages originated in recovery rather then rely on >> messages_free
Which is why I was retesting this issue. But I still see the problem even with the above change. The recovery code seems to work most of the time. But occasionally it doesn't free all of the recovery messages on the queue. It seems there are recovery messages left with seq numbers higher than instance->my_high_delivered/ instance->my_aru. In the last crash I saw there were 12 messages on the recovery queue but only 5 of them got freed by the above patch/code. I think usually a node leave event seems to occur at the same time. I can reproduce the problem reasonably reliably in a 2-node cluster with: #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40 #define TEST_DROP_MCAST_PERCENTAGE 20 But I suspect it's reliant on timing/messaging specific to my system. Let me know if there's any debug or anything you want me to try out. Thanks, Tim On Thu, Jul 7, 2011 at 3:47 PM, Steven Dake <[email protected]> wrote: > On 07/06/2011 05:24 PM, Tim Beale wrote: >> Hi, >> >> We've hit a problem in the recovery code and I'm struggling to understand why >> we do the following: >> >> /* >> * The recovery sort queue now becomes the regular >> * sort queue. It is necessary to copy the state >> * into the regular sort queue. >> */ >> sq_copy (&instance->regular_sort_queue, >> &instance->recovery_sort_queue); >> >> The problem we're seeing is sometimes we get an encapsulated message from the >> recovery queue copied onto the regular queue, and corosync then crashes >> trying >> to process the message. (When it strips off the totemsrp header it gets >> another >> totemsrp header rather than the totempg header it expects). >> >> The problem seems to happen when we only do the sq_items_release() for a >> subset >> of the recovery messages, e.g. there are 12 messages on the recovery queue >> and >> we only free/release 5 of them. The remaining encapsulated recovery messages >> get left on the regular queue and corosync crashes trying to deliver them. >> >> It looks to me like deliver_messages_from_recovery_to_regular() handles the >> encapsulation correctly, stripping the extra header and adding the recovery >> messages to the regular queue. But then the sq_copy() just seems to overwrite >> the regular queue. >> >> We've avoided the crash in the past by just reiniting both queues, but I >> don't >> think this is the best solution. >> > > I would expect this solution would lead to message loss or lockup of the > protocol. > >> Any advice would be appreciated. >> >> Thanks, >> Tim > > A proper fix should be in commit > master: > 7d5e588931e4393c06790995a995ea69e6724c54 > flatiron-1.3: > 8603ff6e9a270ecec194f4e13780927ebeb9f5b2 > > A new flatiron-1.3 release is in the works. There are other totem bugs > you may wish to backport in the meantime. > > Let us know if that commit fixes the problem you encountered. > > Regards > -steve > >> _______________________________________________ >> Openais mailing list >> [email protected] >> https://lists.linux-foundation.org/mailman/listinfo/openais > > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
