-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:
>> What you gain from my proposal over a pure Message-ID approach >> is guaranteed uniqueness given the list copy > > Guarantee is a pretty strong word. A malicious person could post two > messages with the same message-id, same date, but different bodies. No question, if the archive service and the list server are not intimately connected, the communication channel between the two can be subverted. There are ways that channel's trust could be enhanced though, for example by the list server signing its headers in a dkim- like fashion. But in situations where the two are co-located, you can trust these headers even without that enhancement. > So that moves us to how many collisions are reduced in practice. > I have a question about the numbers Barry mined from the python > lists. Are the collisions really that high? One should not count > messages without a message-id, because the MLM can and should > create one in that case. I've uploaded the script I used to here: http://wiki.list.org/download/attachments/786633/scan.py?version=1 It's probably not perfect, and certainly the python.org mbox's may not be representative enough of the real world. Please grab the script, tweak it and run it over your own raw archives; it should be easily modified to handle any of the mailbox formats supported by Python 2.5's mailbox module. If you improve the script or find numbers that lead to different conclusions, now's the time to know! >> and human friendlier urls. > > That's a very compelling point. > > SHA1 can't be computed inside someone's head or simple cut-n-pasted > together for old messages, but I think the usability benefits of > short > URLs (short enough that they can comfortably fit inside message > bodies) > outweighs this drawback. By the way, is SHA-1 still in favor? My > impression was it was fading away after the Shandong University team > partially cracked it. We're not concerned with the cryptographic security claims of SHA1. I don't see any economically beneficial attack on the archives against SHA1 here. I think SHA1 is reasonably universally available, and marginally better than MD5, so it's probably good enough for this application. You're right that no one is going to do SHA1 in their heads, and if they could, they're probably working for some TLA in a secret gubmit basement lab somewhere. The point of course is that a /program/ could easily apply the algorithm to a very minimal existing message and come up with the same canonical url. This enables all kinds of cool applications based on REST-y principles or whatever. The fact that the algorithm leads to short(ish), largely unambiguous (to humans), readable urls is an important benefit -- probably /the/ most important benefit. >> Throw it away or hide [Date]? The former would be a problem, >> but not the latter. > > Thrown away. Really? Wow. I'd have thought every archiving service would want to keep a record of the raw message it received on the wire. That would allow it to regenerate the html archive if necessary, provide useful forensics, and allow for exactly the kind of data mining we're doing here. I can't see /any/ reason for not saving the raw messages in their entirety, especially for a public list. Maybe for a private one, where your data retention policies require you delete things after a certain amount of time, but even there, I can't see why you'd want to trim raw messages rather than just chucking them entirely. > My favorite archival service is based on mhonarc, > and raw mail goes into offline cold storage. What's the advantage of that? Isn't disk space cheap as dirt? Probably cheaper if you've bought any topsoil recently :). Still, the raw messages are still available right? So if there was enough value in calculating the canonical urls so that the archive service could be seen as an interoperability good citizen, then it could be done. I'll just reiterate that I'm not married to including the Date header in the algorithm. Until proven otherwise by more research, I think it's a good idea to use because 1) it's required by RFC 2822 and 2) it seems to reduce collisions. I think the algorithm I propose would work just as well with Message-IDs alone, although there's more of a chance that the non-sequence numbered url will return multiple matches. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2 KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad ERlOYR2onAQ= =8b8I -----END PGP SIGNATURE----- _______________________________________________ Mailman-Developers mailing list [email protected] http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
