Hi,

We've hit a problem in the recovery code and I'm struggling to understand why
we do the following:

        /*
         * The recovery sort queue now becomes the regular
         * sort queue.  It is necessary to copy the state
         * into the regular sort queue.
         */
        sq_copy (&instance->regular_sort_queue, &instance->recovery_sort_queue);

The problem we're seeing is sometimes we get an encapsulated message from the
recovery queue copied onto the regular queue, and corosync then crashes trying
to process the message. (When it strips off the totemsrp header it gets another
totemsrp header rather than the totempg header it expects).

The problem seems to happen when we only do the sq_items_release() for a subset
of the recovery messages, e.g. there are 12 messages on the recovery queue and
we only free/release 5 of them. The remaining encapsulated recovery messages
get left on the regular queue and corosync crashes trying to deliver them.

It looks to me like deliver_messages_from_recovery_to_regular() handles the
encapsulation correctly, stripping the extra header and adding the recovery
messages to the regular queue. But then the sq_copy() just seems to overwrite
the regular queue.

We've avoided the crash in the past by just reiniting both queues, but I don't
think this is the best solution.

Any advice would be appreciated.

Thanks,
Tim
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to