On Wed, 17 Jan 2007, Per Foreby wrote:
> This topic has come up before. I ended up deferring action because analysis
> suggests that the problem is more cosmetic than anything else. My analysis
> suggested that the cost of a "cleanup" tool would be greater than its
> benefit.
Wouldn't a cleanup shell- or perl-script be easy to implement? Just loop
through all .mix########, and unlink all files not referenced in the
index file. Or am I missing something important?
You misunderstand the problem. The mix driver itself cleans up all
.mix######## files that are not referenced in the index file. There is no
need at all to have an external process to do this since it's already
done.
The problem is that, particularly with the incoming flood of spam, it
becomes more likely for new data files to be created in the normal course
of things, and that after the spam is expunged, you're left with many
smaller data files with few messasges in a mailbox, as opposed to fewer
larger data files with many messages.
The cleanup procedure in this case involves consolidating these smaller
files, and updating the index to point to the consolidated file. The
problem is that this cleanup procedure has risks; there are known failure
modes (and possibly some currently-unknown ones) which have to be
protected against. It also defeats the underlying intent of mix which is
to reduce backups.
I believe that, for almost all cases, the risks outweigh the benefits.
It's somewhat like disk defragmentation; there are people who believe that
they have to defragment their hard drive every day, without understanding
that they are actually causing themselves more problems.
I believe that proper tuning of the MIXDATAROLL parameter (an art which I
myself am not yet fully skilled) addresses this problem for almost all
cases, to the point that the rare pathological mailbox can be consolidated
with a manual procedure. As in, something that may get done once every 6
months or so to one mailbox.
> UNIX filesystems have been subjected to mh, maildir, netnews, Cyrus, etc.
> mailstores for many many years; and mix will always do better.
But none of the directory based formats will do well if the number of
files in becomes to large. On reiser, xfs, ext3 with dir_index or any
other "smart" filesystem this is not a problem, but most people use
filesystems which require sequential reads to find a file.
This is all true; which is why the problem is considered at all.
However, most people are going to use mix format with IMAP servers, and
not as a local mailbox format (let's face it, traditional UNIX mailbox
format is the overwhelming winner for local mailboxes). IMAP servers are
not run by "most people"; they are a central facility and are built for
higher performance.
Put another way, I doubt very much that a large enterprise is going to
serve 500,000 IMAP users on a 68040 based BSD 4.3 with FFS.
This is rarely a problem now since the mix format is new, but wait a few
years, and we will se directories with huge amounts of files.
The worst case for number of files in a mix directory is nmsgs+4. However
mix is almost always going to do much better than that.
The problem will only happen if the mailbox has a huge number of messages
AND is actively (but sparsely) expunged. This will almost never happen to
archival mailboxes, nor will it happen to INBOXes belonging to users who
never delete anything. It won't happen to spam buckets either, since they
are generally emptied in their entirety or possibly expired by data.
Put another way; this problem, to the extent that it is real, happens to
users who keep their INBOXes clean by regularly expunging junk, but also
accumlate thousands of unexpunged messages in their INBOX.
Hence my conclusion that any form of automated cure is worse than the
disease. Except in pathological cases, the benefits are dubious; the
costs are high (backups have to be done of all the consolidated data); and
the risks of a foulup are substantial.
I may change my mind if there is empirical data to indicate otherwise.
However, such data is currently lacking.
Currently, the worst cases that happen today are for a sysadmin who keeps
his mailbox clean but still has several hundred messages. He tends to
accumulate a somewhat smaller, but still 3-digit, file count. However, a
3-digit file count is not worth fixing.
> The (still-unwritten) convert-to-mix tool will write multiple data files
> instead of the single huge data file written by mailutil.
Is it really unwritten? The scripts on http://andrew.triumf.ca/mbx/
(which were announced on this list a few months ago) seems to do the
job, and are easy to modify if the original format isn't mix.
If I'm not mistaken, those scripts are based upon mailutil. mailutil can
be used to convert; the probelm is that it writes a single giant data file
instead of splitting it into MIXDATAROLL chunks. This is because mailutil
does an atomic copy, per IMAP, and atomicity overrides MIXDATAROLL.
Normally that is a good thing for copy, but not in conversion.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
_______________________________________________
Imap-uw mailing list
[email protected]
https://mailman1.u.washington.edu/mailman/listinfo/imap-uw