I've done a little looking into journals, fsyncs, and filesystems recently.
One thing I'm trying to understand is whether there was a reason for SQLite choosing to use a rollback journal (of the steps to undo a transaction) rather than a replay journal (of the steps to perform a transaction). In the same way as SQLite uses the presence of a complete (and consistent) journal to indicate that a rollback should be performed, the presence of a complete (and consistent) journal can instead indicate that a replay should be performed. One difference between rollback and replay journals is that with a rollback journal the transaction becomes committed on removal of the journal, but with a replay journal the transaction commits on completion of the journal. This difference means that to ensure durability of a commit with a rollback journal, the parent directory must be synced (after the removal). I can't see that this is done in SQLite, which would mean that durability against filesystem interruption is not obtained for the most recent transaction. (This is not important to me personally, but some may like to know that, on return from a commit statement, the transaction is committed to disk.) The reason that I was considering a replay journal is that for exclusive access mode (at least - it gets more complicated with multiple connections) several transactions could be "committed" to the journal before any need to sync the database file itself. As a possible further optimization, if some "buffer" was kept containing (perhaps locations of) pages modified by the transactions in the journal, then transactions in the journal would not need to be written to the database itself until the journal was so large as to make maintenance of the buffer too costly. This could significantly reduce the frequency of filesystem barrier operations required for consistency, particularly on filesystems with some sort of safe-append behavior (even if it's just that unwritten blocks are zeroed out as on xfs and ntfs). (If durability of each transaction is desired, then writes to the database itself might as well commence after each transaction is committed and synced to the journal, but need not be synced in the database itself until the size of the journal is such that it should be removed.) Thanks, Karl. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users