but if you can trust yourself to generate them, consecutive
integers provide minimal, order-preserving, perfect hashing, too!
Hmm this sounds pretty sensible to me.
Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
Craig Loomis writes:
Globally unique IDs, hashed IDs, etc., are very appealing from
various CS-y and techie points of view, but are simply not memorable
to humans or knowable by dumb external programs. I think as much, or
more, effort should be put into delivering a
Or Re: [Mailman-Developers 10417] Improving the archives
I would like to interject and highlight some use cases for stable
and predictable IDs. For us, message IDs are directly used both by
people and ignorant programs. Our mailing lists serve as a permanent
and concise record of our
Question: what about crossposted messages?
Let's say a message gets sent to a list called mailman-developers
with a CC to a list called pet-bunnies. Hypothetically, of course.
Presumably, the person who got the message from pet-bunnies
should probably end up at the pet-bunnies archive, where the
--On 2 October 2007 22:47:35 -0400 Barry Warsaw [EMAIL PROTECTED] wrote:
One question: should the angle brackets on the Message-ID be part of
the hash or not? I think they should, or IOW, the entire value of
the Message-ID header is taken as the hash, though they should be
stripped off if
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote:
Jeff Breidenbach wrote:
5.85 million messages
That's 0.03% if you count all the messages. It is 0.008% if you
discard the top three offenders, all of which I have contacted.
I'd say that's a
What we really want to know is how many (non-empty) Message-ID
collisions are there that *don't* share a Date? This is the number of
messages that only-messageid loses, and that the composite identifier
method would not lose.
I took a look at a larger dataset, 5.85 million messages from
Jeff Breidenbach wrote:
5.85 million messages
That's 0.03% if you count all the messages. It is 0.008% if you
discard the top three offenders, all of which I have contacted.
I'd say that's a strong argument for just using the Message-ID and
simplifying this tremendously...
...Barry, do you
704 messages fall into this category. Of these, 596 come from a
single (malfunctioning and duplicate spewing) list server. I have
not yet examined the remaining 208 messages, but I'll bet anything
many also have duplicate message bodies. Or are spam. So for this
data set, we have an upper
What we really want to know is how many (non-empty) Message-ID
collisions are there that *don't* share a Date? This is the number of
messages that only-messageid loses, and that the composite identifier
method would not lose.
It took longer than expected, but I now have numbers from
looking
Jeff Breidenbach wrote:
So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only with messages from the
same list - how many times do I see a repeated message-id? The
answer
If you improve the script or find numbers that lead to different
conclusions, now's the time to know!
Live and learn!
So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only
If you are relying on the sender to do the right thing, then
why not force them to create proper message-ids?
I think Barry's proposal is essentially a numbers game - e.g.
he's hoping for significantly better results using Date in
the calculation than not using it.
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
So we just specify a header to put it in, and subscribers will be
able
to use it, per definition of a canonical URL.
It is the archive server's job
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote:
Regardless of whether we *need* to generate our own unique ID, I'm
leaning towards the thought that we're going to *want* to generate
our own for usability reasons. In a perfect world, i think
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote:
So we just specify a header to put it in, and subscribers will be
able
to use it, per definition of a canonical URL.
It is the archive server's job to decide what is the canonical URL
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:
What you gain from my proposal over a pure Message-ID approach
is guaranteed uniqueness given the list copy
Guarantee is a pretty strong word. A malicious person could post two
messages with
Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
This is my concern too. Especially since this is known information; it is
trivial to be malicious. Whatever was done, I think would *have* to deal
with
Hi,
I think this is the first time that I'm posting here but hopefully
not the last. Thanks to everyone involved for an incredible project.
I'm not much of a developer but I like practical solutions and will
do everything possible to help improve in this area even if it's
just to give some
Barry Warsaw writes:
I agree, I just don't think message-ids are user friendly enough to
be this canonical url. Especially in this context, which is exactly
where urls are thrown in users faces. An archiving service is
exactly the right place for redirecting human readable urls to
Barry Warsaw writes:
Yes, definitely. What do you think of the base32 examples I have on
the wiki page?
They're somewhat better than Message-IDs for readability, but they're
not user-friendly.
On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
It seems silly to generate nice short
Notice that of 325146 total messages, 624 of them had no message-id
header. Even if you aggregate dup+col, you're still looking at a
total duplicate rate of 0.29%.
Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
If
Jeff Breidenbach writes:
Notice that of 325146 total messages, 624 of them had no message-id
header. Even if you aggregate dup+col, you're still looking at a
total duplicate rate of 0.29%.
Message ID's are supposed to be unique.
Fortunately, a rule more honored in the observance
st == Stephen J Turnbull
Re: [Mailman-Developers] Improving the archives
Tue, 24 Jul 2007 15:56:35 +0900
st Jeff Breidenbach writes:
Notice that of 325146 total messages, 624 of them had no
message-id header. Even if you aggregate dup+col, you're
still looking
John A. Martin writes:
better to go ahead and use the mesage-id, rather than concoct
yet another this time we mean it! unique identifier.
st That's not the point. We're not going to impose this on
st senders;
I read the quote as meaning this time we mean it
There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.
So we just specify a header to put it in, and subscribers will be able
to use it, per
Jeff Breidenbach wrote:
In addition, Barry was talking about concocting a unique
identifier from the Date field and Message-ID. I'm not a big fan of
this idea, because the date field comes from the mail user agent
and is often wildly corrupt; e;g; coming from 100 years in the future.
Oh--I
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
So we just specify a header to put it in, and subscribers will be
able
to use it, per definition of a canonical URL.
It is the archive server's job to decide what is the canonical URL
for a message. There's a good chance these archival URLs
Regardless of whether we *need* to generate our own unique ID, I'm
leaning towards the thought that we're going to *want* to generate
our own for usability reasons. In a perfect world, i think we'd have
a sequence number so I could visit http://example.com/mailman/
archives/listname/204.html
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
I've looked at a few lurker archivers and I wasn't blown away by its
user interface. That's apparently highly configurable though.
I've been doing a
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:
Which brings me to suggestion #2, which is go ahead and write
an RFC on how list servers should embed archival links in messages.
This sounds like an internet wide interoperability issue as much
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:
I simply think we should be prepared for applications where relying on
the sender to supply a UUID is not acceptable; we need to be able to
provide one ourselves. Creating UUIDs is a solved
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:
What complexity? Mailman just does
msg['X-List-Archive-Received-ID'] = Email.msgid()
Easy to introduce, harder to deal with. The archival server would now
keep track of both the message-id
Jeff Breidenbach writes:
So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.
It is the archive server's job to decide what is the canonical URL
for a message. There's a good chance these archival URLs will be
served by
What you gain from my proposal over a pure Message-ID approach
is guaranteed uniqueness given the list copy
Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
I've looked at a few lurker archivers and I wasn't blown away by its
user interface. That's apparently highly configurable though.
I've been doing a lot of thinking about interface, and I'm coming to
the conclusion that something more like a web
Terri Oda wrote:
I've been doing a lot of thinking about interface, and I'm coming to
the conclusion that something more like a web bulletin board is
probably the way to go
For public lists, the answer may lie in external tools like nabble.com
or mailinglistarchive.com
Of course, that
On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote:
Cool. I wonder if lurker is compatible with Python 2.5's
mailbox.Maildir implementation and whether the two could share the
maildirs. Thanks for the information!
It had better be -- Maildir has a published specification. If
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote:
Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer that would be stable in
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote:
Barry Warsaw wrote:
Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer that
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 8, 2007, at 1:06 AM, Paul Wise wrote:
My personal opinion is that pipermail should be removed and mailman
should not contain a default archiver since there are plenty of good
archivers already (lurker, mhonarc etc). Adding wrappers around
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote:
John A. Martin writes:
In the absence of a Message-ID
on an outgoing mail message many if not most MTAs will add one. Why
not let Mailman anticipate the need to add a Message-ID when
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 5, 2007, at 12:09 PM, John Dennis wrote:
A little over a year ago I went on a search to find the best open
source
archiver and at that time I came up with Lurker
(http://lurker.sourceforge.net) Since then I believe Lurker has seen a
Barry Warsaw writes:
First, I want to avoid talking about file system layout. To me,
that's an implementation detail we needn't worry about right now.
Agreed.
How likely is it that two messages with the same message-id and
date are /not/ duplicates?
For message id generators that
Barry Warsaw writes:
Second, things can happen to a list
that might cause this sequence number to get corrupted.
Add an X-Mailman-Sequence-Number header if not already present.
That doesn't deal with your other comments, but as I point out
elsewhere, if you don't use *any*
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:
How likely is it that two messages with the same message-id and
date are /not/ duplicates?
For message id generators that include a time-stamp in the generated
id, approximately the same as
On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
I've looked at a few lurker archivers and I wasn't blown away by its
user interface. That's apparently highly configurable though.
I'd be inclined to agree wrt user interface. Documentation regarding
this, and anything else to do with lurker,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote:
On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
I've looked at a few lurker archivers and I wasn't blown away by its
user interface. That's apparently highly configurable though.
I'd be
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:
Barry Warsaw writes:
Second, things can happen to a list
that might cause this sequence number to get corrupted.
Add an X-Mailman-Sequence-Number header if not already present.
That
On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
BTW lurker gives all messages an ID which is 3 parts separated by
periods. The first part is a date field - ie 20070720, the second
part is the receive time, UTC, as 6 digits, and the final part
is some form of hex id. The nice part is if you
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Nigel,
On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote:
On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
BTW lurker gives all messages an ID which is 3 parts separated by
periods. The first part is a date field - ie 20070720, the second
On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
Mailman gets the From_ line before passing off to the archiver.
But that's interesting, does lurker /require/ the From_ line?
Well lurker handles Maildir - no From_ but the same info is in the
filename, and it can take messages on stdin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote:
On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
Mailman gets the From_ line before passing off to the archiver.
But that's interesting, does lurker /require/ the From_ line?
Well lurker
Barry Warsaw writes:
But it would have to be subject to the same bounce rules as any other
auto-response which could be used as a spam vector, e.g. limit the
number of bounces per time period and don't include the entire
original message in the bounce
But that prevents detecting a
John A. Martin writes:
In the absence of a Message-ID
on an outgoing mail message many if not most MTAs will add one. Why
not let Mailman anticipate the need to add a Message-ID when archiving
the message rather than leaving it to the outgoing MTA?
Quite.
My reason for saying last
On 7/3/07, Terri Oda [EMAIL PROTECTED] wrote:
I'm trying to remember all the things people have suggested for the
archives in the past so I can figure out what needs to be done and
what might be nice to have, and see if this is doable in the time I
have in the foreseeable future.
At
On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
Since I've largely finished up the coding contract that was eating up
a lot of my time, I'm thinking that I'd like to do some coding for
On 5-Jul-07, at 12:09 PM, John Dennis wrote:
A little over a year ago I went on a search to find the best open
source
archiver and at that time I came up with Lurker
(http://lurker.sourceforge.net) Since then I believe Lurker has seen a
major new revision. I also believe Lurker is the
Barry Warsaw writes:
- archive links that won't break if the archive is rebuilt
Yes, this is absolutely critical, in fact, I'd put it right at the
top of the list, even more so than a u/i overhaul. Stable urls, with
backward compatible redirecting links if at all possible, would
st == Stephen J Turnbull
Re: [Mailman-Developers] Improving the archives
Wed, 04 Jul 2007 16:49:58 +0900
st The main drawback to using Message IDs that I can see is that
st broken MUAs may supply no Message-ID, or the same one
st repeatedly. In the former case, as a last resort
I'm all for someone taking ownership of this long-neglected component --
thank you for doing so!
Barry Warsaw wrote:
Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer
Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer that would be stable in the face of
full archive regenerations.
I'd suggest the reverse. Keep the canoncical archive URL short
In which case [the message body link] would be set to something like.
http://third-party-service/[EMAIL PROTECTED]
Just for fun, I did a trial implementation. It works, but the URLs are
too long.
For example, the URL below spends 59 characters on the messag-id, and
27 characters on the listname.
I'll admit to not having read previous discussions on this topic, but
I'll also add my 2 insert-lowest-denomination-coin here:
On 7/2/07 11:06 PM, Terri Oda wrote:
- better address obfuscation (maybe by generating pages through cgi)
I run a few Wordpress sites, and there's a plugin I use called
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
Since I've largely finished up the coding contract that was eating up
a lot of my time, I'm thinking that I'd like to do some coding for
fun. And nothing says fun like trying to fix the Mailman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Steve makes me think of a couple of other wish list items.
On Jul 3, 2007, at 7:36 AM, Steve Huston wrote:
On 7/2/07 11:06 PM, Terri Oda wrote:
- better address obfuscation (maybe by generating pages through cgi)
I run a few Wordpress sites, and
66 matches
Mail list logo