- **status**: assigned --> unassigned
- **assigned_to**: Anders Bjornerstedt -->  nobody 
- **Milestone**: 4.5.FC --> future



---

** [tickets:#19] IMM: PBE should periodically audit the imm.db file**

**Status:** unassigned
**Milestone:** future
**Created:** Tue May 07, 2013 08:43 AM UTC by Anders Bjornerstedt
**Last Updated:** Fri Nov 22, 2013 10:08 AM UTC
**Owner:** nobody

The Imm Persistent Back-End writes transactions/CCBs incrementally
to an slqlite file "imm.db". This file resides on a replicated file
system. The replicated file system guards against hardware problems
such as failure of the disk or the host where the disk resides.

But there is always a risk of the imm.db file being corrupted
accidentally. This could be due to bugs in the PBE; or due to
network partitioning of the cluster causing two PBEs to
concurrently write to the same file; or accidents with the
backup and restore framework; or problems with the very complex
communication stack which the shared filesystem is (drbd,
journaling, nfs, sqlite recovery).

The problem is that the imm.db file is a logically a single
point of failure at cluster start.

If the imm.db is corrupted due to whatever reason, then
this may not be discovered until the critical time when it
is needed for a cluster restart.

This enhancement proposes that the PBE shall have some form
of periodic audit of of the existing imm.db file.

One possibility is for the PBE to periodically copy the imm.db
file to a local tmp directory. During the copy the PBE will
buffer & delay the regular user requests (Ccbs & PRTA updates).
As soon as the copy has been made, a "pseudo loading" will
be invoked using the copy of imm.db. In essence the immloder
is invoked such that it reads the imm.db in exactly the way
it does during loading, but does not try to actually load
anything towards the immsv.

Note that this level of audit will only catch consistency problems
in the PBE/sqlite representation of the imm data.
Loading may fail on higher levels, by failing checks inside
the immsv or applications (failing validation by OIs).

THe point of this is to discover an inconsistency earlier,
when the problem has hopefully not impacted the executing
cluster. IF a problem is detected, then the PBE will restart
and generate a new version of the imm.db file.

Migrated from:
http://devel.opensaf.org/ticket/2451
------------------------------------------------------------------


The audit could actually verify snapshot value equality between the sqlite 
representation
in PBE and the in-memory representation in immsv. By initializing an iterator
towards the immsv during the short stop period for mutations enforced during the
file copy, the iterator will take a snapshot of the in-memory representation.

That snapshot should reflect all committed CCbs and PRTA updates. The same 
values
should be commited to the PBE representation.
-------------------------------------------------------------------------
 http://list.opensaf.org/pipermail/devel/2012-February/021139.html
-------------------------------------------------------------------------
The fix for this enhancement should be based on an improvement of 
verifyPbeState(..)
in imm_pbe_dump.cc
That function is executed each time the PBE re-attaches to the imm.db file.
Currently it is very weak. It should ideally verify the state of all persistent
objects both ways. All objects that exist in the imm.db must exist in the imm 
and
have the same state; and all persistent objects that exist in the imm must 
exist in
the imm.db file and have the same state.

This same function could be periodically invoked by the immnd-coord using an 
admin-op
towards the pbe. This should only be done during periods when there is a lull in
persistence traffic. The frequency can be quite low, but could also be increased
in relation to write traffic.

Finally, there is a point in closing and re-opening the imm.db file before 
performing
the verification. This to protect agains accidental removal of the file (inode).
-----------------------------------------------------------------



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to