Brian Candler wrote:
On Mon, Feb 20, 2006 at 09:21:04AM +0100, Christian Ferrari wrote:
I'm writing a shell script: it identifies "duplicated" mails, strip
"offending header lines" like "Delivered-To:" and hard links files
paying attention to different filesystems/devices. To avoid complete
scanning, a persistent "memory" is saved in a status file.
After some refinements I'll release it.
Are there some guys want to try it on a (test!) battlefield?
Not me. If you're going to have an MTA independently fork and deliver copies
of the same message to A, B, C and D, then there are massive race conditions
involved:
- you require the MTA to complete delivery to A before starting delivery to
B
- after delivering to A, you require your delivery script to finish updating
its cache database before the MTA starts delivery to B (otherwise you'll
miss the fact that these messages are the same)
- if the MTA can make two or more simultaneous deliveries, which every MTA
I know does, you'll need to lock your cache database to prevent simultaneous
updates [or have an append-only cache which needs to be purged periodically]
- if the MTAs are distributed across multiple front-ends, the cache database
will need to be stored on some central server [can't use flock() over NFS]
You are right, but from a "saving space perspective", it's not necessary
to hard link _all_ mails: many mails are deleted by users in the first
24-48 hours, so my script will address only "old" mails.
I think you need more than a simple shell script to address these issues.
A batch job has the advantage to leave all the other components at its
place without patching any well running software.
It would be far more reliable IMO to have the MTA deliver one copy of the
message itself, when it receives a message with multiple recipients in the
envelope. OTOH, your proposed mechanism has the advantage that it could
identify multiple copies of the same message which are delivered in
different SMTP transactions.
There might be other ways to implement this. For example:
- take an MD5 of the message body (having stripped out Delivered-To: and
Received: and anything else which might be unique to one copy)
- deliver the message to /somepath/x/y/md5hash (where x and y are parts of
the md5 hash). If this file already exists, leave it alone.
- hard link from the maildir to this location
You've guessed MD5 is the cornerstone of my shell script, but your idea
has a little bug: there's a little probability two completely different
files have the same MD5 hash. I use the standard Maildir tree, and store
MD5 hashes in a dedicated table, but MD5 is only a fast way to look for
identical mails, "diff" must be performed before scrambling on the mail
files.
Effectively that's using the filesystem itself as your cache. With care,
i.e. the right choice of atomic operations, it can be made lock-free. You'll
need a separate /somepath base for each filesystem that contains Maildirs,
of course.
Regards,
Brian.
Regards,
Christian F.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Courier-imap mailing list
[email protected]
Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-imap