Re: [Mavibot] Crash recovery system proposed algorithm

Emmanuel Lécharny Sun, 17 Mar 2013 02:42:20 -0700

What magic can do a good night of sleep :)

So, we have a BTree that somehow can keep old versions. This is a
feature we should leverage to recover from a crash.


The real problem is that flushing on disk can be done with a
getFD().sync(), but this is an extremely costly operation. We simply
can't call this method everytime we modify the BTree, otherwise the
performances will just be horrible.

One solution would be to flush on disk periodically, keeping all the
revisions created since the last flush : this will allow us to recover
from an old revision if we had a crash. At worst, we will just recover
from the last flushed revision (which means we will lose all the changes
done in between).

This is a balance : if we don't want to lose anything, we have to flush
on disk after every single write, OTOH, if we accept the idea that we
may lose some of the modifications, we can recover from a crash easily,
the data base will always be in a state we can restart.

Last, not least, we can also use a Journal to keep all the modification
on disk (writing modifications in a journal are less costly, as we don't
update data in many places on the disk).

In any case, the journal is a mandatory piece, and we can always let the
user disable it if the performances are too low.

Thoughts ?

Le 3/17/13 4:00 AM, Emmanuel Lécharny a écrit :
> Le 3/16/13 7:59 PM, sebb a écrit :
>> On 14 March 2013 08:56, Emmanuel Lécharny <elecha...@gmail.com> wrote:
>>> Le 3/13/13 4:39 PM, Emmanuel Lécharny a écrit :
>>>> One small update, as I have made a mistake in my initial mail :
>>>>
>>>>
>>>> It's not implemented atm, will work on that.
>>>>
>>>> Any better idea ?
>>> I rethought about the proposal this morning, and found it over complex.
>>>
>>> A better idea is to store an offset to the BTree headers in a list of
>>> BTree offsets, at the beginning of the file. If this list of offset
>>> can't be stored in a single page, we will use a new page to store the
>>> overflowing offsets. Adding or removing a BTree will just be a matter of
>>> adding or removing an offset from this list (which might require a
>>> rewrite of those pages.
>>> Thoughts ?
>> Seems to me that the currently proposed solutions all depend on the
>> disk blocks being updated in a specific sequence.
> true. The alternative is to use a journal, that stores the pending
> operations, and which is flushed on a timely fashion - and applied when
> we recover from a crash. We can also force the data to be written on
> disk, using file.getFD().sync(), on every modification, but this is
> extremelly costly.
>
> We use a journal for the persisted BTree (ie, a BTree in memory backed
> on disk).
>
> Right now, I don't 'force' the data to be written on disk - ie, if the
> system crashes, you may lost something - as I'm focusing on getting the
> data to be stored correctly when everything is fine. This is oubviously
> something that needs to be improved. However, the critical part is to
> guarantee that we always point to the correct data when we start the DB.
> That requires we always flush the new versions before flushing the BTRee
> header, which revers to this new version.
>
>
>
>> Depending on the hardware/OS/language being used, AFAIK this may not
>> be possible to enforce.
>>
>> May I suggest that any assumptions about the behaviour of the host
>> disk system should be clearly documented?
> Right now, there is no assumption made. In the near future, once all the
> basic operations will work fine, then we will have to think about those
> assumptions,a nd add the code to allow a recovering in case of a crash.
>
> Do you have any proposal, ideas, or suggestion ?
>


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 


---------------------------------------------------------------------
To unsubscribe, e-mail: labs-unsubscr...@labs.apache.org
For additional commands, e-mail: labs-h...@labs.apache.org

Re: [Mavibot] Crash recovery system proposed algorithm

Reply via email to