Re: [Mailman-Developers] Improving the archives

2007-11-03 Thread Jeff Breidenbach
but if you can trust yourself to generate them, consecutive integers provide minimal, order-preserving, perfect hashing, too! Hmm this sounds pretty sensible to me. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org

Re: [Mailman-Developers] Improving the archives

2007-11-03 Thread Stephen J. Turnbull
Craig Loomis writes: Globally unique IDs, hashed IDs, etc., are very appealing from various CS-y and techie points of view, but are simply not memorable to humans or knowable by dumb external programs. I think as much, or more, effort should be put into delivering a

Re: [Mailman-Developers] Improving the archives

2007-10-30 Thread Craig Loomis
Or Re: [Mailman-Developers 10417] Improving the archives I would like to interject and highlight some use cases for stable and predictable IDs. For us, message IDs are directly used both by people and ignorant programs. Our mailing lists serve as a permanent and concise record of our

Re: [Mailman-Developers] Improving the archives

2007-10-03 Thread Jeff Breidenbach
Question: what about crossposted messages? Let's say a message gets sent to a list called mailman-developers with a CC to a list called pet-bunnies. Hypothetically, of course. Presumably, the person who got the message from pet-bunnies should probably end up at the pet-bunnies archive, where the

Re: [Mailman-Developers] Improving the archives

2007-10-03 Thread Ian Eiloart
--On 2 October 2007 22:47:35 -0400 Barry Warsaw [EMAIL PROTECTED] wrote: One question: should the angle brackets on the Message-ID be part of the hash or not? I think they should, or IOW, the entire value of the Message-ID header is taken as the hash, though they should be stripped off if

Re: [Mailman-Developers] Improving the archives

2007-10-02 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote: Jeff Breidenbach wrote: 5.85 million messages That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I'd say that's a

Re: [Mailman-Developers] Improving the archives

2007-08-07 Thread Jeff Breidenbach
What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. I took a look at a larger dataset, 5.85 million messages from

Re: [Mailman-Developers] Improving the archives

2007-08-07 Thread Dale Newfield
Jeff Breidenbach wrote: 5.85 million messages That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I'd say that's a strong argument for just using the Message-ID and simplifying this tremendously... ...Barry, do you

Re: [Mailman-Developers] Improving the archives

2007-08-01 Thread Jeff Breidenbach
704 messages fall into this category. Of these, 596 come from a single (malfunctioning and duplicate spewing) list server. I have not yet examined the remaining 208 messages, but I'll bet anything many also have duplicate message bodies. Or are spam. So for this data set, we have an upper

Re: [Mailman-Developers] Improving the archives

2007-08-01 Thread Jeff Breidenbach
What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. It took longer than expected, but I now have numbers from looking

Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Dale Newfield
Jeff Breidenbach wrote: So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer

Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Jeff Breidenbach
If you improve the script or find numbers that lead to different conclusions, now's the time to know! Live and learn! So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only

Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Jeff Breidenbach
If you are relying on the sender to do the right thing, then why not force them to create proper message-ids? I think Barry's proposal is essentially a numbers game - e.g. he's hoping for significantly better results using Date in the calculation than not using it.

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote: Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote: What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Jason Fesler
Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. This is my concern too. Especially since this is known information; it is trivial to be malicious. Whatever was done, I think would *have* to deal with

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Gustav H Meyer
Hi, I think this is the first time that I'm posting here but hopefully not the last. Thanks to everyone involved for an incredible project. I'm not much of a developer but I like practical solutions and will do everything possible to help improve in this area even if it's just to give some

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Stephen J. Turnbull
Barry Warsaw writes: I agree, I just don't think message-ids are user friendly enough to be this canonical url. Especially in this context, which is exactly where urls are thrown in users faces. An archiving service is exactly the right place for redirecting human readable urls to

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Stephen J. Turnbull
Barry Warsaw writes: Yes, definitely. What do you think of the base32 examples I have on the wiki page? They're somewhat better than Message-IDs for readability, but they're not user-friendly. On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: It seems silly to generate nice short

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. Fortunately, a rule more honored in the observance

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread John A. Martin
st == Stephen J Turnbull Re: [Mailman-Developers] Improving the archives Tue, 24 Jul 2007 15:56:35 +0900 st Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
John A. Martin writes: better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. st That's not the point. We're not going to impose this on st senders; I read the quote as meaning this time we mean it

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
There are three different parties coming to the table. One is the mail transfer agent of the sender, another is the list server, and the third is the archive server. Ideally, all three will be happy campers. So we just specify a header to put it in, and subscribers will be able to use it, per

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Dale Newfield
Jeff Breidenbach wrote: In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Oh--I

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Terri Oda
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 22, 2007, at 12:33 PM, Terri Oda wrote: On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I've been doing a

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote: Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote: I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote: What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. Sometimes the channel between the MLM and the

Re: [Mailman-Developers] Improving the archives

2007-07-22 Thread Terri Oda
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web

Re: [Mailman-Developers] Improving the archives

2007-07-22 Thread Dale Newfield
Terri Oda wrote: I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go For public lists, the answer may lie in external tools like nabble.com or mailinglistarchive.com Of course, that

Re: [Mailman-Developers] Improving the archives

2007-07-21 Thread A.M. Kuchling
On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote: Cool. I wonder if lurker is compatible with Python 2.5's mailbox.Maildir implementation and whether the two could share the maildirs. Thanks for the information! It had better be -- Maildir has a published specification. If

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote: Barry Warsaw wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 8, 2007, at 1:06 AM, Paul Wise wrote: My personal opinion is that pipermail should be removed and mailman should not contain a default archiver since there are plenty of good archivers already (lurker, mhonarc etc). Adding wrappers around

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote: John A. Martin writes: In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 5, 2007, at 12:09 PM, John Dennis wrote: A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes: First, I want to avoid talking about file system layout. To me, that's an implementation detail we needn't worry about right now. Agreed. How likely is it that two messages with the same message-id and date are /not/ duplicates? For message id generators that

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes: Second, things can happen to a list that might cause this sequence number to get corrupted. Add an X-Mailman-Sequence-Number header if not already present. That doesn't deal with your other comments, but as I point out elsewhere, if you don't use *any*

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote: How likely is it that two messages with the same message-id and date are /not/ duplicates? For message id generators that include a time-stamp in the generated id, approximately the same as

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham
On 20 Jul 2007, at 13:39, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I'd be inclined to agree wrt user interface. Documentation regarding this, and anything else to do with lurker,

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 13:39, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I'd be

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote: Barry Warsaw writes: Second, things can happen to a list that might cause this sequence number to get corrupted. Add an X-Mailman-Sequence-Number header if not already present. That

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham
On 20 Jul 2007, at 15:26, Barry Warsaw wrote: BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nigel, On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 15:26, Barry Warsaw wrote: BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham
On 20 Jul 2007, at 15:52, Barry Warsaw wrote: Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? Well lurker handles Maildir - no From_ but the same info is in the filename, and it can take messages on stdin

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 15:52, Barry Warsaw wrote: Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? Well lurker

Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes: But it would have to be subject to the same bounce rules as any other auto-response which could be used as a spam vector, e.g. limit the number of bounces per time period and don't include the entire original message in the bounce But that prevents detecting a

Re: [Mailman-Developers] Improving the archives

2007-07-09 Thread Stephen J. Turnbull
John A. Martin writes: In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when archiving the message rather than leaving it to the outgoing MTA? Quite. My reason for saying last

Re: [Mailman-Developers] Improving the archives

2007-07-07 Thread Paul Wise
On 7/3/07, Terri Oda [EMAIL PROTECTED] wrote: I'm trying to remember all the things people have suggested for the archives in the past so I can figure out what needs to be done and what might be nice to have, and see if this is doable in the time I have in the foreseeable future. At

Re: [Mailman-Developers] Improving the archives

2007-07-05 Thread John Dennis
On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for

Re: [Mailman-Developers] Improving the archives

2007-07-05 Thread Terri Oda
On 5-Jul-07, at 12:09 PM, John Dennis wrote: A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Stephen J. Turnbull
Barry Warsaw writes: - archive links that won't break if the archive is rebuilt Yes, this is absolutely critical, in fact, I'd put it right at the top of the list, even more so than a u/i overhaul. Stable urls, with backward compatible redirecting links if at all possible, would

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread John A. Martin
st == Stephen J Turnbull Re: [Mailman-Developers] Improving the archives Wed, 04 Jul 2007 16:49:58 +0900 st The main drawback to using Message IDs that I can see is that st broken MUAs may supply no Message-ID, or the same one st repeatedly. In the former case, as a last resort

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Dale Newfield
I'm all for someone taking ownership of this long-neglected component -- thank you for doing so! Barry Warsaw wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Jeff Breidenbach
Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. I'd suggest the reverse. Keep the canoncical archive URL short

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Jeff Breidenbach
In which case [the message body link] would be set to something like. http://third-party-service/[EMAIL PROTECTED] Just for fun, I did a trial implementation. It works, but the URLs are too long. For example, the URL below spends 59 characters on the messag-id, and 27 characters on the listname.

Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Steve Huston
I'll admit to not having read previous discussions on this topic, but I'll also add my 2 insert-lowest-denomination-coin here: On 7/2/07 11:06 PM, Terri Oda wrote: - better address obfuscation (maybe by generating pages through cgi) I run a few Wordpress sites, and there's a plugin I use called

Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for fun. And nothing says fun like trying to fix the Mailman

Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Steve makes me think of a couple of other wish list items. On Jul 3, 2007, at 7:36 AM, Steve Huston wrote: On 7/2/07 11:06 PM, Terri Oda wrote: - better address obfuscation (maybe by generating pages through cgi) I run a few Wordpress sites, and