Re: Yet another 'duplicate' thread

Cameron Simpson Tue, 12 Nov 2013 19:13:14 -0800

On 13Nov2013 09:06, Chris Down <[email protected]> wrote:
> On 2013-11-12 19:22:24 +0100, Jonas Petong wrote:
> > Today I accidentally copied my mails into the same folder where they had 
> > been
> > stored before (evil keybinding!!!) and now I'm faced with about a 1000 
> > copies
> > within my inbox. Since those duplicates do not have a unique mail-id, it's
> > hopeless to filter them with mutts integrated duplicate limiting pattern.
> > Command '<limit>~=' has no effect in my case and deleting them by hand
> > will take me hours!
> > 
> > I know this question has been (unsuccessfully) asked before. Anyhow is 
> > there is
> > a way to tag every other mail (literally every nth mail of my inbox-folder) 
> > and
> > afterwards delete them? I know something about linux-scripting but 
> > unfortunately
> > I have no clue where to start with and even which script-language to use.
> 
>     for every file:
>         read file and put the message-id in a dict in { message-id: [file1, 
> file2..fileN] } order
> 
>     for each key in that dict:
>         delete all filename values except the first
> 
> It should not be very complicated to write. If nobody else comes up with
> something, I can possibly it for you after work.


Based on Jonas' post:

 Since those duplicates do not have a unique mail-id, it's hopeless
 to filter them with mutts integrated duplicate limiting pattern.
 Command '<limit>~=' has no effect

I'd infer that the message-id fields are unique.

Jonas:

_Why_/_how_ did you get duplicate messages with distinct message-ids?
Have you verified (by inspecting a pair of duplicate messages) that
their Message-ID headers are different?

If the message-ids are unqiue for the duplicate messages I would:

  Move all the messages to a Maildir folder if they are not already so.
    This lets you deal with each message as a distinct file.

  Write a script long the lines of Chris Down's suggestion, but collate
  messages by subject line, and store a tuple of:
    (message-file-path, Date:-header-value, Message-ID:-header-value)

You may then want to compare messages with identical Date: values.

Or, if you are truly sure that the folder contains an exact and complete 
duplicate:
load all the filenames, order by Date:-header, iterate over the list (after 
ordering)
and _move_ every second item into another Maildir folder (in case you're wrong).

  L = []
  for each Maildir-file-in-new,cur:
    load in the message headers and get the Date: header string
    L.append( (date:-value, subject:-value, maildir-file-path) )

  L = sorted(L)
  for i in range(0, len(L), 2):
    move the file L[i][1] into another directory

Note that you don't need to _parse_ the Date: header; if these are
duplicated messages the literal text of the Date: header should be
identical for the adjacent messages. HOWEVER, you probably want to
ensure either that all the identical date/subject groupings are
only pairs, in case of multiple distinct messages with identical
dates.

Cheers,
-- 
Cameron Simpson <[email protected]>

If you can't annoy somebody, there's little point in writing.
        - Kingsley Amis

Re: Yet another 'duplicate' thread

Reply via email to