On Fri, Apr 05, 2002 at 03:17:01PM +0200, Jesus Cea Avion wrote:
> Problem:
> 
> - People leaves mail in the mailbox. Scanning the mailbox every time is
> a I/O hungry operation
> 
> - Rewritting a partially updated mailbox is very expensive. UIDL update,
> partial mailbox deleting, mail arrives while the popper is running...
 
  Correct, both points.  These are major performance problems right
now.

> Solution:
> 
> A simple and efficient database (key/value) used to store messages. For
> example, BerkeleyDB (http://www.sleepycat.com/)
 
  However, this loses compatibility with the many existing mail-related
programs which rely on the well-known UNIX mbox format.

> Qpopper would have six operations:
> 
> - Translate est�ndar mailboxes into the database.
> 
> - Serve mails from database.
> 
> - An additional tool to show statistics about users: messages in
> database, lenght, last login, quota...
> 
> - An additional tool to list and delete a concrete user message.
> 
> - An additional tool to delete an user and all its messages.
> 
> - An additional tool to kill all popper processes, disable POP3 logins
> and reconstruct the database if it's neccesary. This operation,
> tipically, lasts 4-5 seconds.
> 
> We could have have another tool to delete messages already read and
> older that a month, for example.
 
  And you never hereafter receive mail?  You need to at least have some
interface for MTAs to deliver mail into the database other than by
someone popping their mail!  And nobody will want shell access to mail,
and nobody will want IMAP, and nobody will want to use procmail or
seive or maildrop or ...?

  I think you're really no longer talking about redesigning Qpopper
when you add this scope, you are talking about implementing most of a
complete new mail system, and you need to make it coexist with at least
the most common dozen or so other packages that form other parts of the
mail system.

  The idea of a databased mail system is potentially a good one, and is
being kicked around by a lot of people in a lot of forms.  For one
implementation in progress, which I heard about on the Postfix list,
see <http://www.dbmail.org/>  Note that there are still some bugs here.

  However, IMHO reading Brad Knowles' Lisa 2000 paper on large mail
systems should be a prerequisite for proposing a solution like this.
  <http://www.shub-internet.org/brad/papers/dihses/lisa2000/> The
bottlenecks aren't necessarily where one might assume.  (For instance,
he claims the assumption that maildir improves performance is not a
given, as it requires more writes of "synchronous meta-data" to the
filesystem.  That's an important factor!)

> Example:
> 
> You could have a central mailbox database. Every email in the database
> would have a unique UID. Every message resides in two register, for
> example. One register contains the message body. The other register has
> the message headers, which can be modified by qpopper (UIDL, Status,
> etc).
...
> When an user enters POP3, qpopper would translate new messages in user
> standard mailbox into the database (erasing the original mailbox). Then,
> the messages are served from the database. The message migration can be
> implemented, also, with a cron job to migrate mailboxes with infrequent
> logins.

  If you're keeping initial messages delivered from [your MTA] in mbox
format, this probably means you're doing this mbox scan on many
sessions, which means you're doing a large part of the I/O currently
needed.

> Advantages:
> 
> - You don't need scan anything when you have the messages in the
> database. You know, everytime, how many messages an user has, lenght,
> and so on. If new email arrives, you migrate it to the database.
 
  This is the key advantage, but really this part boils down to having
a better message info cache system, which can be implemented without
completely reimplementing Qpopper into a database.

> - You can delete individual messages without needing a mailbox
> rewriting.

  You still have to rewrite the database... but you do save on avoiding
repeated handling of the old saved messages.  (Maildir also wins on
this.)
 
> - You can modify headers without expensive I/O, since headers (tipically
> <2Kbytes) are kept separated from message bodies.
 
  Qpopper shouldn't be modifying the headers, other than to add a UID
(which can be avoided!)

> - New messages arriving while qpopper is working don't require mailbox
> rewriting.
 
  They require loading into the database at the next POP session,
though.

> - Berkeley DB, for example, can retrieves partial registers. That is,
> you can have a 15 MB message, and you don't need to read it in a shot.
> In fact, you can read the message in 64 Kbytes chunks, for example, to
> keep memory and I/O small.
 
  That doesn't actually reduce I/O, just splits it up.  Qpopper doesn't
read the whole 15MB in one chunk either.

> - Berkeley DB overhead in disk space and CPU is fairly small.
...

  I think many of these are valid points, but some don't apply, and
some offer simpler solutions.

  To make my concerns clear: I don't totally reject the idea of using a
database for mail.  You are also correct that it is critical for
performance to eliminate the cases where Qpopper now needs to
completely rewrite a mailbox.  However, I think to properly implement
this database proposal, it will need to go far beyond the scope of
Qpopper, and much of the gains from the Qpopper-specific portions of it
could be gotten in simpler ways.  

  Perhaps another way of putting it is: if all these changes were made
to Qpopper as you describe, would it still be Qpopper at the end and
usable as it is now, or would it be a totally different beast?

  For Qpopper to be able to work as it does now, for systems using just
mbox format, but also be able to work as you describe, then its present
mailbox I/O would need to be abstracted to a separate mailbox interface
I/O layer, somewhat along the lines of the UW-imapd "c-client" code. 
(I don't personally like the UW code style, but there are clean ways to
implement the same goal.)

  This would be a good first step for Qpopper because it creates a
clean abstraction which would enable a number of enhancements, starting
with integrating maildir in a coherent way, but also including a direct
interface to databased mailsystems like you're describing.

  All IMHO.  As your sig says "Things are not so easy."
  -- Clifton

-- 
    Clifton Royston  --  LavaNet Systems Architect --  [EMAIL PROTECTED]
"What do we need to make our world come alive?  
   What does it take to make us sing?
 While we're waiting for the next one to arrive..." - Sisters of Mercy

Reply via email to