On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+deb...@jameswestby.net> 
> I like the architecture of notmuch, and have just switched
> to using it as my primary client, so thanks.

You're quite welcome, James. Welcome to notmuch!

> Therefore I'd like to fix this. The obvious way is to
> introduce documents in to the db for each id we see, and
> threading should then naturally work better.

That sounds like a fine idea.

> The only issue I see with doing this is with mail delays.
> Once we do this we will sometimes receive a message that
> already has a dummy document. What happens currently with
> message-id collisions?

The current message-ID collision logic is pretty brain-dead. It just
says "Oh, I've seen a file with this message before, so I'll skip this
additional file".

But I'm just putting the finishing touches on a patch that instead does:

        Oh, and here's an additional filename for that message ID. Add
        that too, please.

Beyond that, all we would need to do as well is to also index the new
content. I don't want to do useless re-indexing when files just get
renamed. So maybe all we need to do is to save the filesize of the
last-indexed file for a document and then when we encounter a file with
the same message ID and a larger file size, then index it as well?

That would even take care of providing the opportunity to index
additional mailing-list-added content for messages also sent directly
via CC.

The file-size heuristic wouldn't be perfect for these other cases. I
guess we save a list of sha-1 sums for indexed files or so, (assuming
that's cheaper than just re-indexing---before the Xapian Defect 250 fix
I'm sure it is, but after I'm not sure---we maybe should just always
re-index---but I think I have seen the TermGenerator appear in profiles
of indexing runs.)

>   * When we get a message-id conflict check for dummy:True
>     and replace the document if it is there.
> How does this sound?

That sounds fine. It's the same as what I propose above with
"filesize:0" instead of "dummy:true".

> There could be an issue with synthesising too many threads
> and then ending up having to try and put a message in two
> threads? I see there is code for merging threads, would that
> handle this?

It should, yes.

The current logic is that a message can only appear in a single
thread. So if a message has children or parents with distinct thread IDs
then those threads are merged.

I can imagine some strange cross-posting scenario where one could argue
that the merging shouldn't happen, but I'm not sure we want to try to
respect that.


Attachment: pgpljkeHch1Gq.pgp
Description: PGP signature

notmuch mailing list

Reply via email to