Backend recovery

Emmanuel Lecharny Thu, 20 Nov 2008 06:28:26 -0800

Hi guys,

as I'm currently working on a recovery tool, I have reviewed the currentconfiguration we have. Let's first analyse how a recovery tool is goodfor and how it should work.


1) Usage

We need to rebuild the indices if the database is corrupted. This canoccur in many occasions, mainly when the server is brutally interupted(power shutdown, for instance). As indices are dynamically built whileinjecting new entries (or when updating those entries), it is importantto have such a tool to rebuild them assuming the indices are not up todate, as otherwise we may have orphans entries, or even worst, indicespointing on non existing entries or wrong entries.

Another usage for such aa tool would be to create indices off line. Thishas the great advantage to allow a mass injection of entries into themaster table (the table containing all the entries), then do a globalre-indexing, potentially avoiding a lot of expensive controls, as we mayrun a pre-check on data before doing the injection. (of course, if weleverage the number of checks to do, we can go faster or slower,depending on how much we trust the data we inject into the server)


2) How it should work

This is what we have a real problem. Assuming that the backend storagecan be totally FU, we can't really trust it to rebuild the indices. Whatare our options ? Let's see how the system works when we modify someentries :- the database is in a S0 state at t0 (let's say when we have startedthe server the very first time)- each added entry change the current state. As we may have a differedwrite, this state is written on the disk only every N seconds (unlessthe differed write is not activated)

- at some point, we are in state Si, and we have some more modifications

- then we have a crash. It can occurs before the current modificationsare written on disk (and we have lost every modification since state Si(a), or, worse, it occurs while we are writing those modifications, andnow, we can have a totally unstable base (b).

Case (a) can be handled, as the base is not corrupted. The problem isthat we have lost some data, which may be a problem. However, if wedon't use differed writes, we can avoid such a case (except for the lastmodification), with the major inconvenient that we are more likely tofell in case (b) now...

Case (b) is more problematic, because we have no way to determinatewhich was the previous stable state (Si), and to restore this state, asthe base has already been partially modified.

What is the solution ? We have to assume that the state Si can berestored, and that we can apply every modification on it. The only wayto do that is to combine two techniques :- backup the base on regular basis, assuming that it can be done withoutallowing any update on this base during the backup (and this is notobvious, as the base can be pretty big)- and store every modification into a journal, to be able to replay themon the restored base.


(keep in mind that we are not using a transactional system).

3) How does it translate for ADS ?

ADS has a ChangeLog interceptor which logs all the modified entries as alist of modifications. Each change has a unique revision (to be fixed,as we are not using a synchronized counter atm), and can be stored intoa sequential text file, each modification being stored as a LDIF changeoperation( we currently don't have a file based storage).

The idea would be to add some way to save the underlying files, and thento apply all the stored logs on these files. We can even replay thewhole journal from day one, but this will be totally overkilling. Athird option, IMHO way better, as it eliminate the need to do a backup,would be to apply the log on a separate base, so that we don't need todo backups on the fly (of course, before applying the current logs, weshould backup the spare files, and when the logs has been applied, ditchthe N-1 backup). Here is the algorithm :


Journal last position is N, we have n modifications since then
Spare current base is version N

- time to apply the journal ! Mark the current position in the journalas N+1

- copy the spare base (Spare-N) to Spare-(N+1)
- apply the journal from position N to position N+1

- if everything is ok, tell the system that the current backup base isSpare-(N+1) and that the current log position is N+1

- now, we can ditch the Spare-N base

Journal last position is N+1 now, and we may have m modifications sincethen, as the server continue to log

Spare current base is version N+1

In order to get this working in the server, we have thinsg to do :

- decouple the Log sync from the partition sync (currently both arewritten at the same time)

- extend the ChangeLog to write in a flat file, injecting LDIF into it
- add a thread to implement the previous algorithm

- add a handler to run the previous algorithm if the server has crashed,when the server is restarted

- add a CL option to run the algo offline

A second advantage would be to allow a bulk load without all the timeconsumming controls we have in the server.


Thoughts ?

--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Backend recovery

Reply via email to