Hi guys,
as I'm currently working on a recovery tool, I have reviewed the current
configuration we have. Let's first analyse how a recovery tool is good
for and how it should work.
1) Usage
We need to rebuild the indices if the database is corrupted. This can
occur in many occasions, mainly when the server is brutally interupted
(power shutdown, for instance). As indices are dynamically built while
injecting new entries (or when updating those entries), it is important
to have such a tool to rebuild them assuming the indices are not up to
date, as otherwise we may have orphans entries, or even worst, indices
pointing on non existing entries or wrong entries.
Another usage for such aa tool would be to create indices off line. This
has the great advantage to allow a mass injection of entries into the
master table (the table containing all the entries), then do a global
re-indexing, potentially avoiding a lot of expensive controls, as we may
run a pre-check on data before doing the injection. (of course, if we
leverage the number of checks to do, we can go faster or slower,
depending on how much we trust the data we inject into the server)
2) How it should work
This is what we have a real problem. Assuming that the backend storage
can be totally FU, we can't really trust it to rebuild the indices. What
are our options ? Let's see how the system works when we modify some
entries :
- the database is in a S0 state at t0 (let's say when we have started
the server the very first time)
- each added entry change the current state. As we may have a differed
write, this state is written on the disk only every N seconds (unless
the differed write is not activated)
- at some point, we are in state Si, and we have some more modifications
- then we have a crash. It can occurs before the current modifications
are written on disk (and we have lost every modification since state Si
(a), or, worse, it occurs while we are writing those modifications, and
now, we can have a totally unstable base (b).
Case (a) can be handled, as the base is not corrupted. The problem is
that we have lost some data, which may be a problem. However, if we
don't use differed writes, we can avoid such a case (except for the last
modification), with the major inconvenient that we are more likely to
fell in case (b) now...
Case (b) is more problematic, because we have no way to determinate
which was the previous stable state (Si), and to restore this state, as
the base has already been partially modified.
What is the solution ? We have to assume that the state Si can be
restored, and that we can apply every modification on it. The only way
to do that is to combine two techniques :
- backup the base on regular basis, assuming that it can be done without
allowing any update on this base during the backup (and this is not
obvious, as the base can be pretty big)
- and store every modification into a journal, to be able to replay them
on the restored base.
(keep in mind that we are not using a transactional system).
3) How does it translate for ADS ?
ADS has a ChangeLog interceptor which logs all the modified entries as a
list of modifications. Each change has a unique revision (to be fixed,
as we are not using a synchronized counter atm), and can be stored into
a sequential text file, each modification being stored as a LDIF change
operation( we currently don't have a file based storage).
The idea would be to add some way to save the underlying files, and then
to apply all the stored logs on these files. We can even replay the
whole journal from day one, but this will be totally overkilling. A
third option, IMHO way better, as it eliminate the need to do a backup,
would be to apply the log on a separate base, so that we don't need to
do backups on the fly (of course, before applying the current logs, we
should backup the spare files, and when the logs has been applied, ditch
the N-1 backup). Here is the algorithm :
Journal last position is N, we have n modifications since then
Spare current base is version N
- time to apply the journal ! Mark the current position in the journal
as N+1
- copy the spare base (Spare-N) to Spare-(N+1)
- apply the journal from position N to position N+1
- if everything is ok, tell the system that the current backup base is
Spare-(N+1) and that the current log position is N+1
- now, we can ditch the Spare-N base
Journal last position is N+1 now, and we may have m modifications since
then, as the server continue to log
Spare current base is version N+1
In order to get this working in the server, we have thinsg to do :
- decouple the Log sync from the partition sync (currently both are
written at the same time)
- extend the ChangeLog to write in a flat file, injecting LDIF into it
- add a thread to implement the previous algorithm
- add a handler to run the previous algorithm if the server has crashed,
when the server is restarted
- add a CL option to run the algo offline
A second advantage would be to allow a bulk load without all the time
consumming controls we have in the server.
Thoughts ?
--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org