Re: [sqlite] presentation about ordering and atomicity of filesystems

Howard Chu Sat, 13 Sep 2014 13:26:02 -0700

Scott Robison wrote:

On Fri, Sep 12, 2014 at 6:21 PM, Richard Hipp <d...@sqlite.org> wrote:

On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin <slav...@bigfraud.org>
wrote:


   one thing that annoys me about SQLite is that it needs to make a
journal file which isn't part of the database file.  Why ?  Why can't it
just write the journal to the database file it already has open ?  This
would reduce the problems where the OS prevents an application from
creating a new file because of permissions or sandboxing.

Where in the database does the journal information get stored?  At the
end?  What happens then if the transaction is an INSERT and the size of the
content has to grow?  Does that leave a big hole in the middle of the file
when the journal is removed?  During recovery after a crash, where does the
recovery process go to look for the journal information?   If the journal
is at some arbitrary point in the file, where does it look.  Note that we
cannot write the journal location in the file header because the header
cannot be (safely) changed without first journaling it but we cannot
journal the header without first writing the journal location into the
header.

Journaling filesystems already have this problem. By default they just use asection of the partition, reserved at FS creation time. Which leads to theproblem already described in the video that started this thread - perform alarge enough write operation and you can exceed the fixed size of the journal,which requires the journal data to be split and the operation journal updateis no longer atomic.

Of course, most journaling filesystems also allow you to optionally specify anexternal journal - i.e., instead of embedding the journal on the filesystem'spartition, you can use some other block device instead. Naturally you can alsochoose a larger size when doing this. Putting the journal on a separate devicecan bring some major performance benefits, as well as accomodating largertransactions.

In the tests I did two years ago, JFS with an external journal was blazinglyfast. http://symas.com/mdb/microbench/july/#sec11

One idea that might work is to interleave the journal information with the
content.  So for each page in the database, there is a corresponding page
of journal content.  The downside there is that you double the size of the
database file without increasing its storage capacity.

This is why LMDB is much better suited to this task - it uses no journal atall, nor does it require compaction/defragmentation/VACUUMing.

A couple of academic thoughts.

1. If one wanted to embed the journal within the database, would it be
adequate to reserve a specific page as the "root" page of the journal, then
allocate the remaining pages as normal (either to the journal or the main
database)? This does leave the big hole problem so it may still not be
ideal, but it would give you a known location to find the beginning of the
journal without doubling the database size or requiring an extra file.


Starting with a known location is definitely a step in the right direction.

2. Building on 1, could sparse files be used to accomplish this? Seek to
"really big constant offset" and do all journaling operations at that
point, allowing the operating system to manage actual disk allocation? If

We're talking about implementing a filesystem. "the operating system" is yourown code, in this case, you don't get to foist the work off onto anyone else.

this were possible, deleting the journal would be a "fast" truncate
operation. A custom VFS might be able to provide a proof of concept... hmm.


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] presentation about ordering and atomicity of filesystems

Reply via email to