The proposed solution above involves an elaboration of the sync protocol.

An alternative, simpler and thus safer, solution would be for the PBE 
to probe the file system for writability before attempting sqlite
transaction buildup and particularly before sqlite transaction commit.

The probe would be done in a separate thread (to avoid blocking the
main thread), or the main thead could use the asyncronous file system API.
The probe could for example write the CCB-id for the next ccb to be 
committed, to a small separate file. If this write times out, then the
probe has failed. A failed probe would result in the PBE aborting the
currently critical CCB, instead of attempting to commit it via sqlite.

This solution will reduce the risk of the PBE getting hung due to a blocked
file system. But it will not eliminate it. 
If the probe returns ok, the PBE will start the commit of the sqlite
transaction and if the file system *then* gets blocked, during the sqlite
commit, the PBE will be hung until the file system becomes available again.



---

** [tickets:#31] IMM: Allow imm-sync to start when there are critical ccbs 
still active**

**Status:** unassigned
**Created:** Tue May 07, 2013 11:01 AM UTC by Anders Bjornerstedt
**Last Updated:** Wed Aug 28, 2013 09:35 AM UTC
**Owner:** Anders Bjornerstedt

Migrated from:
http://devel.opensaf.org/ticket/3005
--------------------------------------
It has been discovered that problems with the performance of the
shared file system can in some cases result in problems for
imm-sync to make progress. This will be the case if:

1) Problems with the FS resulting in PBE being "stuck".
2) The backlog includes at least one ccb, which will then be

    in the critical state (commit delegated to PBE/SQLite).

3) There is a boot/reboot of some node in the cluster triggering

    a sync.

In principle imm-sync should be totally independent of the PBE and
the file system. But the sync will not start until there are no
active CCBs. Non critical CCBs are given a period of grace and
after that are aborted by the imm server. But CCBs that are
in the critical phase can not be aborted by the imm server since
this could result in a divergence between the ram state and the
file state. So the sync will be blocked by any such critical ccbs
that are backlogged.

There are two possible solutions:

1) The "simplest" is for the immsv to brutally remove the imm.db
file and abort the ccbs that where in critical. This has the
serious disadvantage that the imm.db file needs to be regenerated
(towards a file system that is not performing well) and until that
has been done, any cluster restart will escalate to a super outage,
i.e. a restore, not just a reload of the cluster. The likelyhood
of a cluster restart is also elevated in this situation because
typically the node that needs to be synced is an SC, which means
the cluster is in a degraded 1-safe state. The file system problems
also have a tendency to raise the risk of a cluster restart caused
by other components dependent on hte file system.
So I dont think this is the right kind of solution.

2) Change the implementation of sync to cope with critical ccbs.
This will complicate the sync protocol, since the active ccb
in essence needs to be set up at the sync client node, by the sync
protocol. The commit of the ccb may also arrive at any time.
It can arrive before during or after the messages to set up the
ccb at the sync client arrives. The commit may also arrive long
after the entire sync is completed. And there may be additional
subsequent syncs that have to deal with indefinitely open CCBs
in critical. All ccbs in critical can be resolved by restarting
the PBE. But a restart of the PBE is dependent on the file system.
So that is not a solution for the problem focused on here.
The assumption has to be that the file system may be "eternally"
unavailable.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to