I've done a little looking into journals, fsyncs, and filesystems recently.

One thing I'm trying to understand is whether there was a reason for SQLite
choosing to use a rollback journal (of the steps to undo a transaction) rather
than a replay journal (of the steps to perform a transaction).

In the same way as SQLite uses the presence of a complete (and consistent)
journal to indicate that a rollback should be performed, the presence of a
complete (and consistent) journal can instead indicate that a replay should
be performed.

One difference between rollback and replay journals is that with a rollback
journal the transaction becomes committed on removal of the journal, but with
a replay journal the transaction commits on completion of the journal.

This difference means that to ensure durability of a commit with a rollback
journal, the parent directory must be synced (after the removal).  I can't see
that this is done in SQLite, which would mean that durability against
filesystem interruption is not obtained for the most recent transaction.
(This is not important to me personally, but some may like to know that, on
return from a commit statement, the transaction is committed to disk.)

The reason that I was considering a replay journal is that for exclusive
access mode (at least - it gets more complicated with multiple connections)
several transactions could be "committed" to the journal before any need to
sync the database file itself.

As a possible further optimization, if some "buffer" was kept containing
(perhaps locations of) pages modified by the transactions in the journal, then
transactions in the journal would not need to be written to the database
itself until the journal was so large as to make maintenance of the buffer too
costly.

This could significantly reduce the frequency of filesystem barrier operations
required for consistency, particularly on filesystems with some sort of
safe-append behavior (even if it's just that unwritten blocks are zeroed out
as on xfs and ntfs).

(If durability of each transaction is desired, then writes to the database
itself might as well commence after each transaction is committed and synced
to the journal, but need not be synced in the database itself until the
size of the journal is such that it should be removed.)

Thanks,
Karl.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to