Re: [sqlite] Sqlite 3.5.9 journal_mode vs ext3 journaling

Bob Ebert Thu, 29 May 2008 13:14:19 -0700

I was under the impression that setting PRAGMA synchronous=NORMAL also
removed the fsync() at step 6.  I'm pretty sure when we run with
synchronous=NORMAL we see only one fsync() call per COMMIT operation,
not 2 (and definitely not 3 as we see with synchronous=FULL).

                                                --Bob

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of D. Richard Hipp
Sent: Thursday, May 22, 2008 6:06 PM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] Sqlite 3.5.9 journal_mode vs ext3 journaling

On May 22, 2008, at 6:50 PM, Bob Ebert wrote:

> I wonder if anyone is in a position to compare and contrast the 
> journal_mode and synchronous settings with the various ext3 journal 
> modes?
>
> Up until now we've been using ext3 with data=ordered, and sqlite3 with

> synchronous=normal, journal_mode=delete.  We're on an embedded system 
> with a very high frequency of random power cycling.  In this mode, 
> we've seen an unusually large number of corrupted databases.
>
> My best theory right now is that we lose power after the ext3 metadata

> has written the delete of the -journal file, but before all of the 
> page overwrites for the db file are fully flushed, since these are 
> done by two different processes in normal linux.

SQLite calls fsync() before it calls unlink() on the journal file.  So
if fsync() is working as documented, all database data should be safely
on oxide prior to the unlink().  We have seen instances before where
fsync() returned long before the data was on oxide, so I would not be
surprised by this.  You can often see this yourself by mounting a flash
memory stick, writing a file on the stick, calling fsync() then watching
the LED on the end of the stick continue to flash long after the fsync()
has returned.

Please recognize that there really is nothing that SQLite can do to
correct this problem.  SQLite depends on the operating system and/or
disk controller living up to its end of the contract.  If the disk says
all data is on oxide, SQLite has no choice but to believe it because
SQLite has no way to independently verify the matter.  And if the disk
controller and/or operating system is lying, and a power failure
follows, the database can be corrupted.  There isn't much you can do
when your hardware starts lying to you.

>  I believe if we lose power at
> this point, then after a restart ext3 will replay the journal delete, 
> but will leave the main DB in a semi-written state, and thus corrupt 
> the DB.
>
> I'm wondering what impact switching to journal_mode=persist will have 
> on this scenario.  I believe this will change how the last step of the

> atomic commit (clearing the journal) is written to disk, and thus 
> changes how ext3 will recover the file after a power loss.  Currently 
> our IO scheduler doesn't guarantee ordered writes, so in theory the 
> journal header clearing could still make it to disk before all the 
> main db pages.  We're in a position to adjust the IO scheduler if 
> necessary to prevent this.

Once again, SQLite calls fsync() on the database prior to calling
write() to clear the journal header.  So *if* fsync is doing its job,
you should have no problems.  *If*.

Note the the whole point of these fsync() calls in SQLite is to act as
an I/O barrier operation - to guarantee that all I/O operations that are
issued prior to the fsync() complete prior to any I/O operations issued
afterwards.  We must have an I/O barrier in order to preserve database
integrity across a power failure.  If your fsync() is not work quite
right, then all bets are off.  I don't know of anything SQLite can do to
make the situation better.

>
>
> Are there any other potential holes or races between ext3 journal data

> and sqlite file contents that I should worry about?  Is 
> synchronous=full the only way to guarantee atomic commits under these 
> conditions?
>                       

A synchronous=FULL commit goes like this:

       1.  Write all pages that will change into the journal
       2.  fsync() the journal
       3.  Overwrite the journal header to indicate that it is valid
       4.  fsync() the journal
       5.  Write changes into the database file.
       6.  fsync() the database files
       7.  Delete or truncate or overwrite the journal header (depending
on journaling mode)

The difference between FULL and NORMAL is that NORMAL omits the
fsync() on step 2.  That is the only difference.  The step-2 fsync is
important on some filesystems, but on ext3 it is probably unnecessary.
So I don't think that going to synchronous=FULL is going to help you.

D. Richard Hipp
[EMAIL PROTECTED]

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Sqlite 3.5.9 journal_mode vs ext3 journaling

Reply via email to