On Sun, 28 Mar 2010 11:28:29 +0100 "Russell L. Harris" <[email protected]> wrote:
> I have on the order of 10 Gb of mail files. > > Most of the files are in maildir format; a few are in mbox format. > > The system is Debian GNU/Linux. > > I would like to eliminate duplicate messages. There appear to be, on > the average, perhaps four or five copies of each message. > > I also would like to sort the messages on the To: and From: fields, > saving only certain matches. > > I have been searching with Google for "maildir delete duplicate > perl", but I have not yet found a script which looks promising. > > Is there a good standard approach, script, or application for this > problem? I would probably take a multi-step approach. I would look for a module on CPAN that reads the maildir format (for example, Email::Folder::Maildir, which I found from search.cpan.org). I would use that to match the To and From fields and remove any that I didn't want. The best way to find duplicates is probably through the use of a message digest and a hash. Walk the messages, passing each through Digest::SHA1 or Digest::MD5 and use the result as the key to a hash. If it already exists in the hash, delete the message. If not, add it to the hash. Admittedly, that's just an outline of an approach, but it should get you started. G. Wade -- The purpose of software engineering is to control complexity, not to create it. -- Dr. Pamela Zave _______________________________________________ Houston mailing list [email protected] http://mail.pm.org/mailman/listinfo/houston Website: http://houston.pm.org/
