-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At some point hitherto, Benjamin Scott hath spake thusly:
>   A Berkeley/mbox format mail store is basically a flat text file.  

Being dissatisfied with the performance of UW IMAP, I've been thinking
a lot about this problem.  I've got some ideas about how to make some
(potentially drastic, I think) performance improvements.

> Messages are appended one after another.  There is no index.  To
> read message number #368, you have to read and parse messages 1
> through 367, too.  And you have to do that every time you open the
> mail store.

This doesn't necessarily need to be true, strictly speaking.  You
certainly do need to read the whole file to parse all the messages, at
least once, but you don't necessarily need to do it every time you
open the mail store.  For instance, the IMAP server can index the file
and cache the results.  It then only needs to update the cache when it
detects the mail store has changed.  It can retain this cache on disk
(perhaps only writing the cache out to disk when it closes a store),
with information that will allow it to determine whether the file has
changed since the cache was created, to improve performance.  On the
other hand, I'm not really sure what sort of performance impact
there'd be if the particluar mail store was very high volume...

> This gets really ugly if you have a large number of messages, or
> very large messages (think file attachments).  To compound the
> problem, many IMAP clients disconnect and reconnect very frequently.

Under the above scenario, the penalty for this can be minimized...

> When it comes to deletes, the process of moving the mailbox to /tmp
> is stupid, but the rest of it is unavoidable.  If you delete a
> message from a an mbox mail store, you have to rewrite the whole
> file, just like you do for any other text file.

This also is not strictly true.  You can rewrite the file in place,
and then you need only re-write messages that come after deleted
messages.  This has some interesting implications.

As most old messages in a mailbox tend to be messages that the mail
user wants to keep around, most messages which get deleted in a given
session tend to be towards the end of the mail store.  This means that
deleting messages from the store can have much less of a penalty than
they often do in some implementations.

There are two ways (that I can think of) to do this re-writing.  The
technique for both is basically the same, though the mechanism is
slightly different.  Also, note that this only needs to be done when
the client performs an EXPUNGE.  Depending on what protocol you're
using, you might be able to get away with simply marking a deleted
message as such in memory, without modifying the physical mail store
at all, until the mailbox is expunged or closed, depending on the
protocol.

The technique is to create a buffer which is the size of the message
being deleted.  Then move messages which are not being deleted down in
the file in chunks of the size of the buffer.  As you encounter more
deleted messages, you can increase the size of the buffer, so that
more of the mail store is re-written at once.  Obviously you need to
make adjustments for when the next chunk is less than the size of your
buffer, or when you're crossing the boundary of a message that is
being deleted...

The two mechanisms to do this would be MMIO, or use the appropriate
combinations of fseek, fread, and fwrite.  The first method would
likely be faster on machines with sufficient memory, but might suffer
from swapping problems on machines with little memory.  The second
method is less prone to paging.  Deciding which to use would likely
involve experimentation, or providing both and using run-time
information to determine which is more appropriate to the situation.

Now, whether or not these methods improve performance over what UW
IMAP does, I can't say.  Though I suspect that they can't help but do
so.... ;-)  Also, the second method basically guarantees the server
would use both less disk space and less memory.  I'm not so sure about
MMIO (seems like it would be a memory hog, but I don't really know
that much about MMIO, unfortunately).  I keep meaning to write a
library to implement this, in order to compare, but I never seem to
get around to it.  Ultimately, I would want to use these in an
implementation of an IMAP server with which I might replace UW IMAP,
but that means I'd need to implement the IMAP protocol, and after
looking through some of the RFC's, that really just doesn't look like
fun.  ;-)

>   Basically, mbox sucks.  :-)

Sure, but the alternatives do too.  They just suck differently.  They
do have significant benefits over mbox.  They generally don't suffer
from locking issues, as there's basically no need to lock at all with
message-per-file mailbox formats.  They also make writes (e.g.
deletes) much faster.

The trick is initially opening the mail store.  Large Maildir
mailboxes typically take an order of magnitude longer to open than
mbox files, on account of the fact that the program must go through a
series of cycles:

  open -> read -> close -> (filesystem seek) -> open -> read -> close

This takes MUCH longer than mbox's

  open -> read -> read -> read -> read -> close

Since filesystems on Unix systems tend to minimize fragmentation by
design, the likelihood is that you will not need to seek much when
reading a large mbox file.  You also do not need to do repeated open
calls, which have substantial overhead.  

Opening a mailstore on mbox, for these reasons, is almost invariably
*much* faster.  And personally, that's the part I like waiting for the
least.  Filesystem design can certainly impact HOW MUCH mbox wins...
But (using the mutt e-mail client as a bench mark) various members of
the mutt-users mailing list tested local (identical) mailboxes using
both mbox and maildir, on ext[23], reiserfs, ufs, and FreeBSD's super
duper enhanced FS with patches to improve performance for filesystems
with lots of small files (which did the best with maildir out of all
of them), but for this operation, mbox won IN ALL CASES.


- -- 
Derek Martin               [EMAIL PROTECTED]    
- ---------------------------------------------
I prefer mail encrypted with PGP/GPG!
GnuPG Key ID: 0x81CFE75D
Retrieve my public key at http://pgp.mit.edu
Learn more about it at http://www.gnupg.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8rng5djdlQoHP510RAn4uAKCMh31ZIkSAgxGX0JNQBfmYf3VNEQCfczIi
X9fal1PdArP7CwKMJIRgvkE=
=zO53
-----END PGP SIGNATURE-----

*****************************************************************
To unsubscribe from this list, send mail to [EMAIL PROTECTED]
with the text 'unsubscribe gnhlug' in the message body.
*****************************************************************

Reply via email to