Deduplication ?

Vladimir Marek Mon, 2 Jun 2014 15:54:47 +0200

> > I want to import bigger chunk of archived messages into my notmuch
> > database. It's about 100k messages. The problem is, that I most probably
> > have quite a lot of those messages in the DB. Basically I would like to
> > add only those I don't have already.
> >
> > There are two possibilities
> >
> > a) I will add all the 100k messages and then remove the duplicities.
> >
> > b) I will write a script which will parse the message ID's of the
> >    to-be-added messages and try to match them to the notmuch DB. Adding
> >    only files I can't find already.
> >
> > Ad b) might be better option, but I started to play with the idea of
> > deduplication. I'm thinking about listing all the message IDs stored in
> > DB, listing all files belonging to the IDs and deleting all but one.
> > Also I'm thinking about implementing some simple algorithm telling me
> > whether the messages are really very similar. Just to be sure I don't
> > delete something I don't want to.
> >
> > Was anyone playing with the idea?
> 
> notsync[1] used the (lack of) existence of a message id in the store to
> decide whether to add something from an IMAP server, but it is old,
> crufty, unused and unloved code.


I see, that's close to my b) solution, thanks!
-- 
        Vlad

Deduplication ?

Reply via email to