Jeff Breidenbach wrote: > So I just looked at 2 million raw messages from 2007, spread over > a few thousand mailing lists (all data is from mail-archive.com). My > first question was - when comparing only with messages from the > same list - how many times do I see a repeated message-id? The > answer was ... drumroll please ... 260 thousand. What the hell?
I think the question you were originally going to ask got sidetracked. If we assume that all these "multiple paths from list to archive" duplicates not only share a Message-ID but also a Date (they were the same message originally, so they should!), then both schemes (messageid, and messageid+date) would decide that all (but one of) these messages are redundant. What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. -Dale _______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp