The fix for this enhancement needs to review the solution of shared epoch
done in:

    https://sourceforge.net/p/opensaf/tickets/556/




---

** [tickets:#31] IMM: Allow imm-sync to start when there are critical ccbs 
still active**

**Status:** unassigned
**Created:** Tue May 07, 2013 11:01 AM UTC by Anders Bjornerstedt
**Last Updated:** Tue May 07, 2013 11:01 AM UTC
**Owner:** Anders Bjornerstedt

Migrated from:
http://devel.opensaf.org/ticket/3005
--------------------------------------
It has been discovered that problems with the performance of the
shared file system can in some cases result in problems for
imm-sync to make progress. This will be the case if:

1) Problems with the FS resulting in PBE being "stuck".
2) The backlog includes at least one ccb, which will then be

    in the critical state (commit delegated to PBE/SQLite).

3) There is a boot/reboot of some node in the cluster triggering

    a sync.

In principle imm-sync should be totally independent of the PBE and
the file system. But the sync will not start until there are no
active CCBs. Non critical CCBs are given a period of grace and
after that are aborted by the imm server. But CCBs that are
in the critical phase can not be aborted by the imm server since
this could result in a divergence between the ram state and the
file state. So the sync will be blocked by any such critical ccbs
that are backlogged.

There are two possible solutions:

1) The "simplest" is for the immsv to brutally remove the imm.db
file and abort the ccbs that where in critical. This has the
serious disadvantage that the imm.db file needs to be regenerated
(towards a file system that is not performing well) and until that
has been done, any cluster restart will escalate to a super outage,
i.e. a restore, not just a reload of the cluster. The likelyhood
of a cluster restart is also elevated in this situation because
typically the node that needs to be synced is an SC, which means
the cluster is in a degraded 1-safe state. The file system problems
also have a tendency to raise the risk of a cluster restart caused
by other components dependent on hte file system.
So I dont think this is the right kind of solution.

2) Change the implementation of sync to cope with critical ccbs.
This will complicate the sync protocol, since the active ccb
in essence needs to be set up at the sync client node, by the sync
protocol. The commit of the ccb may also arrive at any time.
It can arrive before during or after the messages to set up the
ccb at the sync client arrives. The commit may also arrive long
after the entire sync is completed. And there may be additional
subsequent syncs that have to deal with indefinitely open CCBs
in critical. All ccbs in critical can be resolved by restarting
the PBE. But a restart of the PBE is dependent on the file system.
So that is not a solution for the problem focused on here.
The assumption has to be that the file system may be "eternally"
unavailable.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to