Re: external journal questions

David Masover Thu, 23 Feb 2006 11:19:53 -0800

Jure Pečar wrote:
> Hi all,
> 
> Now that solid state disks are getting affordable (Gigabyte iRam, for
> example), it makes sense to use them as external journal with full data
> journaling, so they cache all the small writes and dump them to disks
> in one single sequential write on every journal flush.


Where can I find the paper on why this makes sense?  Because offhand, it
doesn't, unless you're hoping that the majority of transactions can be
flushed on boot, rather than unrolled.

> I know how to configure that under ext3. Simply set up external journal
> and mount filesystem with data=journal and commit=600 or some such
> value. But I'm not so sure about reiserfs.

v3, I don't know, it's probably closer to the way ext3 works.  Closer,
not exactly, because ext3's on-disk format is ext2 + a journal file, so
it's going to be the easiest to move to another device.

I'm going to assume you aren't talking about v4, since this sounds like
a mission-critical production-style environment.  As I understand it, v4
has a completely different way of doing journaling.



I'm replying to you, not because I actually have an answer for you, but
because your case seems interesting, and I'm curious how Reiser4 handles it.

Currently, my v4 is built like this:  I have 2 gigs of RAM and about a
350 gig Reiser4 partition.  I have a custom patch that replaces the
sys_fsync system call with a stub, because fsync performance is worse in
Reiser4 than in anything else, and because fsync gets horrendously
abused by so many programs I use.  Flushing on my system is a privilege,
not a right, and abusing the fsync call means I've revoked that privilege.

So, basically, application says "Oh my GOD and all that is holy, this
piece of data MUST be written to disk now!"  The OS ignores it, and puts
it in the write buffer with everything else.  At this point, nothing
will touch the disk until I need RAM (even for more disk cache), or I
run a "sync" call myself (not an fsync call), or I unmount the FS (shut
the box down).

This makes performance much better, but it sucks for me when my
not-entirely-stable box (overclocked, has proprietary nVidia graphic
drivers) decides to crash.  Now, I have to say, Reiser4 has proven quite
durable, and I haven't actually had significant corruption.  I have,
however, lost the small amounts of data that never made it to disk.

Now, fsync is basically a kind of transaction, and in the future,
Reiser4 will be doing all kinds of transactions.  Could be an email
server, for instance.  The mail server wants to complete the transaction
of adding a new mail to someone's inbox, or at least to some local
spool, before it tells the other server that it successfully received
the message.  Now, obviously, we don't want every message immediately
forced onto the disk, because that would kill performance.  But if we
don't do that, if we let those transactions stay in RAM, then when the
mailserver loses power, it WILL lose messages.

Lose, as in permanently.  Now, imagine it's something even more
important, like a bank computer.  You can't just "roll back" someone's
mortgage payment, can you?

So, in order for performance not to suck, and to keep fragmentation
down, we want a bunch of transactions to be flushed to disk at once.
But in order to not lose data, we want every important transaction to
immediately be guaranteed not to be lost.

So, you buy some battery-backed RAM or some such, and flush your
transactions there first, and be sure they are successfully written to
the "journal" device before you OK the purchase or accept the email or
whatever, and then, when that device starts to fill up, you flush it out
to the real hard disk.

Problem is, I see nowhere for this to fit in the current model of
Reiser4.  As I understand it, there is no concept of a separate
"journal" device, or of writing a file twice, because the vast majority
of writes are simply written out to disk in the new location, and then
the "commit" is updating the metadata to point to the new location and
free the old.

But, at least some code must already be there, right?  Because I know at
least in theory, some writes happen twice -- things like updates to a
database file.  Tiny changes to huge files would be written once to a
new location, and then back to the old, so the file stays in the same
place on disk, and doesn't get fragmented.

Could that logic be adapted to write first to some journal device, and
then to the original location on disk?  And to use the same
memory-pressure strategy that currently drives the decision to flush
from RAM to disk?  (That is, be lazy about moving stuff from the journal
device to the real medium...)

Could it be done flexibly?  For instance, have a number of "journal
devices", from your RAM all the way to your real disk, and be able to
specify which ones are faster (be lazy about moving from a faster device
to a slower one) and which ones are stable (at what point can we be sure
the data won't be lost to a power failure)?  Because obviously, there
will be degrees of persistent storage, just as there are degrees of
volatile storage, from swap to RAM to cache to register.

I would argue that, while an attempt could be made to do this
transparently, it shouldn't.  For instance, Laptop Mode is a set of
patches to try to not use the hard drive at all, but whenever you have
to spin it up, flush everything you can.  It's sort of lazy writes to
save battery.

I would argue that the filesystem can and should know about lazy writes,
even if you still need Laptop Mode to tell it to flush on reads.  And
the filesystem can and should know about fast, nonvolatile storage.

But, why that's a good idea, I'm not sure of right now, because it's
bedtime.  Ask me tomorrow, or write your own rant.



Well, end of rant.  Someone else gets to have fun coding this, because I
have to do some Real Work.  As in School.

signature.asc
Description: OpenPGP digital signature

Re: external journal questions

Reply via email to