On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+debian at jameswestby.net> wrote: > I like the architecture of notmuch, and have just switched > to using it as my primary client, so thanks.
You're quite welcome, James. Welcome to notmuch! > Therefore I'd like to fix this. The obvious way is to > introduce documents in to the db for each id we see, and > threading should then naturally work better. That sounds like a fine idea. > The only issue I see with doing this is with mail delays. > Once we do this we will sometimes receive a message that > already has a dummy document. What happens currently with > message-id collisions? The current message-ID collision logic is pretty brain-dead. It just says "Oh, I've seen a file with this message before, so I'll skip this additional file". But I'm just putting the finishing touches on a patch that instead does: Oh, and here's an additional filename for that message ID. Add that too, please. Beyond that, all we would need to do as well is to also index the new content. I don't want to do useless re-indexing when files just get renamed. So maybe all we need to do is to save the filesize of the last-indexed file for a document and then when we encounter a file with the same message ID and a larger file size, then index it as well? That would even take care of providing the opportunity to index additional mailing-list-added content for messages also sent directly via CC. The file-size heuristic wouldn't be perfect for these other cases. I guess we save a list of sha-1 sums for indexed files or so, (assuming that's cheaper than just re-indexing---before the Xapian Defect 250 fix I'm sure it is, but after I'm not sure---we maybe should just always re-index---but I think I have seen the TermGenerator appear in profiles of indexing runs.) > * When we get a message-id conflict check for dummy:True > and replace the document if it is there. > > How does this sound? That sounds fine. It's the same as what I propose above with "filesize:0" instead of "dummy:true". > There could be an issue with synthesising too many threads > and then ending up having to try and put a message in two > threads? I see there is code for merging threads, would that > handle this? It should, yes. The current logic is that a message can only appear in a single thread. So if a message has children or parents with distinct thread IDs then those threads are merged. I can imagine some strange cross-posting scenario where one could argue that the merging shouldn't happen, but I'm not sure we want to try to respect that. -Carl -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20091218/5cda441f/attachment.pgp>