On Wed, 17 Jan 2007, Per Foreby wrote:
> This topic has come up before. I ended up deferring action because analysis > suggests that the problem is more cosmetic than anything else. My analysis > suggested that the cost of a "cleanup" tool would be greater than its > benefit. Wouldn't a cleanup shell- or perl-script be easy to implement? Just loop through all .mix########, and unlink all files not referenced in the index file. Or am I missing something important?

You misunderstand the problem. The mix driver itself cleans up all .mix######## files that are not referenced in the index file. There is no need at all to have an external process to do this since it's already done.

The problem is that, particularly with the incoming flood of spam, it becomes more likely for new data files to be created in the normal course of things, and that after the spam is expunged, you're left with many smaller data files with few messasges in a mailbox, as opposed to fewer larger data files with many messages.

The cleanup procedure in this case involves consolidating these smaller files, and updating the index to point to the consolidated file. The problem is that this cleanup procedure has risks; there are known failure modes (and possibly some currently-unknown ones) which have to be protected against. It also defeats the underlying intent of mix which is to reduce backups.

I believe that, for almost all cases, the risks outweigh the benefits. It's somewhat like disk defragmentation; there are people who believe that they have to defragment their hard drive every day, without understanding that they are actually causing themselves more problems.

I believe that proper tuning of the MIXDATAROLL parameter (an art which I myself am not yet fully skilled) addresses this problem for almost all cases, to the point that the rare pathological mailbox can be consolidated with a manual procedure. As in, something that may get done once every 6 months or so to one mailbox.

> UNIX filesystems have been subjected to mh, maildir, netnews, Cyrus, etc. > mailstores for many many years; and mix will always do better. But none of the directory based formats will do well if the number of files in becomes to large. On reiser, xfs, ext3 with dir_index or any other "smart" filesystem this is not a problem, but most people use filesystems which require sequential reads to find a file.

This is all true; which is why the problem is considered at all.

However, most people are going to use mix format with IMAP servers, and not as a local mailbox format (let's face it, traditional UNIX mailbox format is the overwhelming winner for local mailboxes). IMAP servers are not run by "most people"; they are a central facility and are built for higher performance.

Put another way, I doubt very much that a large enterprise is going to serve 500,000 IMAP users on a 68040 based BSD 4.3 with FFS.

This is rarely a problem now since the mix format is new, but wait a few years, and we will se directories with huge amounts of files.

The worst case for number of files in a mix directory is nmsgs+4. However mix is almost always going to do much better than that.

The problem will only happen if the mailbox has a huge number of messages AND is actively (but sparsely) expunged. This will almost never happen to archival mailboxes, nor will it happen to INBOXes belonging to users who never delete anything. It won't happen to spam buckets either, since they are generally emptied in their entirety or possibly expired by data.

Put another way; this problem, to the extent that it is real, happens to users who keep their INBOXes clean by regularly expunging junk, but also accumlate thousands of unexpunged messages in their INBOX.

Hence my conclusion that any form of automated cure is worse than the disease. Except in pathological cases, the benefits are dubious; the costs are high (backups have to be done of all the consolidated data); and the risks of a foulup are substantial.

I may change my mind if there is empirical data to indicate otherwise. However, such data is currently lacking.

Currently, the worst cases that happen today are for a sysadmin who keeps his mailbox clean but still has several hundred messages. He tends to accumulate a somewhat smaller, but still 3-digit, file count. However, a 3-digit file count is not worth fixing.

> The (still-unwritten) convert-to-mix tool will write multiple data files > instead of the single huge data file written by mailutil. Is it really unwritten? The scripts on http://andrew.triumf.ca/mbx/ (which were announced on this list a few months ago) seems to do the job, and are easy to modify if the original format isn't mix.

If I'm not mistaken, those scripts are based upon mailutil. mailutil can be used to convert; the probelm is that it writes a single giant data file instead of splitting it into MIXDATAROLL chunks. This is because mailutil does an atomic copy, per IMAP, and atomicity overrides MIXDATAROLL. Normally that is a good thing for copy, but not in conversion.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
_______________________________________________
Imap-uw mailing list
[email protected]
https://mailman1.u.washington.edu/mailman/listinfo/imap-uw

Reply via email to