> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of J. van Baardwijk
[snip]
> 1. When people who receive the Digest reply to a post, their
> messages often
> do not have the subject header "Re: <subject>" but the subject
> header "Re:
> Brin-L Digest ####". Although the headers differ, the posts are
> essentially
> part of the same thread.
The fact that the subject references the digest makes unambiguous the need
to figure out what thread the message belongs to. Sometimes, I'd expect
that to be impossible, though, and it'll end up as a singleton. However,
just like any other message, parsing of some of the quoted text (assuming
there is some, of course) should often be enough to identify its parent.
The next step would be to use some feature vector extraction to try to match
messages to threads. Feature vectors are essentially derived key words; the
more of them a pair of documents have in common, the more related the
documents are likely to be. I'm actually more interested in feature vectors
for essentially the opposite reason -- to identify when a thread topic
diverges significantly.
However, the kind of tasks above are not going to scale very well. I'm
going to have to pick and choose when to dig deep into analysis.
> 2. When scanning for replies, you may look at subject headers that start
> with "Re:". However, people who receive the Digest and actually bother to
> change the subject header will often simply copy & paste the header from
> the original post. This results in subject headers that do not start with
> "Re:" and are therefore not recognisable as replies. Rather, the scanning
> software will interpret such posts as the first message in a new thread.
I completely ignore "Re:" and its variants. They get stripped out so that I
can query on subjects without having to deal with all the variants.
Instead, I rely on timestamps and the methods above for figuring out the
sequence of messages. The timestamp from the originating system may be
bogus (lots of people have computers with clocks set wrong or apparently not
working -- you can spot Mac users by their 1904 time stamps sometimes).
However, the timestamps of the mail servers along the way are usually quite
accurate; the only real trick is converting (too dang many) time formats to
GMT. If resolution down to seconds were needed, this probably would fail
fairly often.
> 3. What happens when someone starts a new thread and uses a
> subject header
> that has been used before? Let's say that someone starts a thread "Uplift
> Universe" at one point in time, and a year later someone starts a new
> thread with the exact same header. Given the time between the two
> threads,
> they are clearly separate threads; but will your program
> recognise them as
> such?
Yes, but I'm glad you reminded me. That was part of an earlier
implementation, but I don't think I remembered to specify it in what I'm
doing now. However, showing the multiple "clusters" of messages around a
subject is often interesting.
> 4. The abbreviation "Re:" appears in a four different forms: "Re:
> <subject>", "RE: <subject>" (with a capital E), and both versions also
> without a blank between the colon and the title. The solution to this
> particular problem should be obvious...
It can be even worse, as I recall. I'm going to have to dig into some old
code. For what it's worth, I was doing this stuff in 1994, starting with
the original mailing lists in which the WWW was defined and refined.
Imagine this series of subjects:
Uplift War
Re: Uplift Ware
Re: Re: Uplift War
Football (was Re: Uplift War)
Football (was Re: Re: Uplift War)
This kind of weirdness happens when people cut and paste or re-type
subjects. In this example, an extra leading space fouled things up. Then
two people tried to change the subject the same way, but variants snuck in.
On the other hand, this seems to happen far less often these days, so it may
just be a brain exercise.
Nick