I just put together a carefull analysis of the logs art SC-2-1 and SC-2-2
showing  that this was a case of false postive. That is the discrepancy
detected by finalize-sync was not caused by any error. Rather the implementer
with the name TA-Tgen-PL-E1-safTg14-80, connected under id 3438, then
disconnected, then connected under the id 3452 all during a sync and the
two final operations disconnect and connect during thre same second as
the finalizeSync message arrives. 

When posting this our wonderfull new ticket tool decided that this should be
discarded instead of posted.
I dont have time to cut and paste all those log entries again so you
will have to trust me on that.
In any case the theory matches what is in the soure code. IF the above
scenario happened so that the finalizeSync was generaed when the implementer
had id 3438 but that message arrived over fevs after the disconnect and 
reconnect, then the false positive error would be flagged.


---

** [tickets:#551] IMM:  Immnd coredump in finalizeSync at veteran nodes due to 
mismatched implementer**

**Status:** assigned
**Created:** Thu Aug 22, 2013 10:00 AM UTC by Anders Bjornerstedt
**Last Updated:** Thu Aug 22, 2013 10:00 AM UTC
**Owner:** Anders Bjornerstedt

This is similar to #2963  (http://devel.opensaf.org/ticket/2963) 
and #2918 (http://devel.opensaf.org/ticket/2918) both of which where
problems resulting from enhancement #1871 (http://devel.opensaf.org/ticket/1871)
delivered to OpensAF 4.3 for allowing saImmOiImplementerSet during sync.

There is again a case of apparent mismatch of implementer-id in finalizeSync at
veteran IMMNDs. Veteran IMMNDs (IMMNDs that are already up and synced)
oportunistically use part of the sync for verifying that they agree on cluster
resource state with the coordinator IMMND. In this case there is a mismatch on
implemener-id for a certain implementer-name:

    Aug 17 10:48:25 SC-2-1 osafimmnd[15998]: ER Sync-verify: Established 
    node has different Implementer-id: 3452 for name: TA-Tgen-PL-E1-safTg14-80,
    sync says 3438. 

The problem is rare and hard to reproduce. Such a failure can either be due to 
a real mismatch (true positive) due to some true inconsistency. But it may also
be due to a false mismatch (false positive) caused by the way the finalizeSync
message is generated at coord. The finalizeSync message is generated by the
coord at a point in the fevs sequence that is earlier than the point in the
fevs sequence where the nodes (including the coord) receive the finalizeSync
message over fevs. 

For operations that are allowed during sync and that affect some state of the
immnd, that state may then change in the fevs-sequence that occurs after the
coord genertes the message until it is received by all nodes. 





---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to