Re: [Openais] Question about recovery code

Tim Beale Thu, 14 Jul 2011 23:39:20 -0700

Hi Steve,

Thanks for your help. I've tried out your patch and confirmed it fixes
the problem.


Cheers,
Tim

On Fri, Jul 8, 2011 at 10:36 AM, Steven Dake <[email protected]> wrote:
> On 07/07/2011 03:07 PM, Tim Beale wrote:
>> Hi Steve,
>>
>> Thanks for your help. When we upgraded to v1.3.1 we picked up commit
>> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2:
>>>> totemsrp: free messages originated in recovery rather then rely on 
>>>> messages_free
>>
>> Which is why I was retesting this issue. But I still see the problem even 
>> with
>> the above change.
>>
>> The recovery code seems to work most of the time. But occasionally it doesn't
>> free all of the recovery messages on the queue. It seems there are recovery
>> messages left with seq numbers higher than instance->my_high_delivered/
>> instance->my_aru.
>>
>> In the last crash I saw there were 12 messages on the recovery queue but only
>> 5 of them got freed by the above patch/code. I think usually a node leave 
>> event
>> seems to occur at the same time.
>>
>
> I speculate there are gaps in the recovery queue.  Example my_aru = 5,
> but there are messages at 7,8.  8 = my_high_seq_received which results
> in data slots taken up in new message queue.  What should really happen
> is these last messages should be delivered after a transitional
> configuration to maintain SAFE agreement.  We don't have support for
> SAFE atm, so it is probably safe just to throw these messages away.
>
> Could you test my speculatory patch against your test case?
>
> Thanks!
> -steve
>
>> I can reproduce the problem reasonably reliably in a 2-node cluster with:
>> #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40
>> #define TEST_DROP_MCAST_PERCENTAGE 20
>> But I suspect it's reliant on timing/messaging specific to my system. Let me
>> know if there's any debug or anything you want me to try out.
>>
>> Thanks,
>> Tim
>>
>> On Thu, Jul 7, 2011 at 3:47 PM, Steven Dake <[email protected]> wrote:
>>> On 07/06/2011 05:24 PM, Tim Beale wrote:
>>>> Hi,
>>>>
>>>> We've hit a problem in the recovery code and I'm struggling to understand 
>>>> why
>>>> we do the following:
>>>>
>>>>       /*
>>>>        * The recovery sort queue now becomes the regular
>>>>        * sort queue.  It is necessary to copy the state
>>>>        * into the regular sort queue.
>>>>        */
>>>>       sq_copy (&instance->regular_sort_queue, 
>>>> &instance->recovery_sort_queue);
>>>>
>>>> The problem we're seeing is sometimes we get an encapsulated message from 
>>>> the
>>>> recovery queue copied onto the regular queue, and corosync then crashes 
>>>> trying
>>>> to process the message. (When it strips off the totemsrp header it gets 
>>>> another
>>>> totemsrp header rather than the totempg header it expects).
>>>>
>>>> The problem seems to happen when we only do the sq_items_release() for a 
>>>> subset
>>>> of the recovery messages, e.g. there are 12 messages on the recovery queue 
>>>> and
>>>> we only free/release 5 of them. The remaining encapsulated recovery 
>>>> messages
>>>> get left on the regular queue and corosync crashes trying to deliver them.
>>>>
>>>> It looks to me like deliver_messages_from_recovery_to_regular() handles the
>>>> encapsulation correctly, stripping the extra header and adding the recovery
>>>> messages to the regular queue. But then the sq_copy() just seems to 
>>>> overwrite
>>>> the regular queue.
>>>>
>>>> We've avoided the crash in the past by just reiniting both queues, but I 
>>>> don't
>>>> think this is the best solution.
>>>>
>>>
>>> I would expect this solution would lead to message loss or lockup of the
>>> protocol.
>>>
>>>> Any advice would be appreciated.
>>>>
>>>> Thanks,
>>>> Tim
>>>
>>> A proper fix should be in commit
>>> master:
>>> 7d5e588931e4393c06790995a995ea69e6724c54
>>> flatiron-1.3:
>>> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2
>>>
>>> A new flatiron-1.3 release is in the works.  There are other totem bugs
>>> you may wish to backport in the meantime.
>>>
>>> Let us know if that commit fixes the problem you encountered.
>>>
>>> Regards
>>> -steve
>>>
>>>> _______________________________________________
>>>> Openais mailing list
>>>> [email protected]
>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>>
>
>
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Question about recovery code

Reply via email to