Re: Yet another 'duplicate' thread

2013-11-15 Thread Jonas Petong
On 15.Nov 2013, 01:18, Gregor Zattler wrote:
 Hi Jonas,
 * Jonas Petong jonas.pet...@web.de [13. Nov. 2013]:
  On 13.Nov 2013, 13:01, Nathan Stratton Treadway wrote:
  On Wed, Nov 13, 2013 at 18:50:44 +0100, Jonas Petong wrote:
  Cameron, you were right, the message id's are the same. From the matter 
  of fact
  that limiting my Inbox by ~= did not work led me to the conclusion that 
  their
  IDs have been different. Seems like you've teached me wrong so.
  
  What happens when you try to limit by ~= ?  
  
  (Note that as I understand this limit only works when the sort order is
   thread.  That is, with no limit applied you should be seeing the
   duplicate messages marked with an = character your mailbox index
   listing, and then those marked messages will be selected by the 
  ~=
   filter.)
  
  solved... 
 
 but then, why limit the view?  Delete the duplicate message
 right away:
 
 1) open mailfoder in question
 2) switch to threaded view (per default the key binding is ot)
 3) delete-pattern ~= (per default the key-binding is D~=RETURN
 4) carefully examine if the right messages are flagged with a D
 5) expunge the messages via sync-messages (default key-binding
in index is $).

In fact this is how I did in the end. Thank you anyways, though. Seems
like I didn't point this out in detail when writing my first request.

Have a nice weekend!
Jonas

 
 Done.
 
 HTH, Gregor

-- 
the basis of a healthy, tidy mind is a big trash basket. [Kurt Tucholsky]


Re: Yet another 'duplicate' thread

2013-11-14 Thread Jonas Petong
On 14.Nov 2013, 10:24, Cameron Simpson wrote:
 On 13.Nov 2013, 13:01, Nathan Stratton Treadway wrote:
  (Note that as I understand this limit only works when the sort order is
   thread. That is, with no limit applied you should be seeing the
   duplicate messages marked with an = character your mailbox index
   listing, and then those marked messages will be selected by the ~=
   filter.)
 
 Worth restating. This is something of a mutt annoyance - silent failure.
 
 On 13Nov2013 20:38, Jonas Petong jonas.pet...@web.de wrote:
  Sorry for that one! Cameron, could you explain me anyhow how to use that 
  script
  you proposed? Or at least which environment to set? Might be of use for 
  further
  stuck in nowhere problems (even if for no reason as in my case). You all 
  have
  a great day!
 
 Well, the script as supplied is pseudocode (and of course untested),
 but based around using Python. (If you don't know Python, it is
 well worth learning.)

in fact I was going to learn python anyways for the simple fact that it is the
preferred script language to manage a raspberry pi! I'll take your advice for
sure then.

 
 A fuller (but still totally untested) sketch might look like this:
 
 #!/usr/bin/python
 
 import sys
 import email.parser
 from mailbox import Maildir
 
 # get the maildir pathname from the command line
 mdirpath = sys.argv[1]
 
 # open the Maildir
 M = Maildir(mdirpath)
 
 # list holding message information
 L = []
 for key in M.keys():
 # open the message file
 fp = M.get_file(key)
 # load the headers from this message
 hdrs = email.parser.Parser().parse(fp, headersonly=True)
 # speculative: get the filename of the message
 pathname = fp.name
 fp.close()
 # make a tuple with the info we want
 info = hdrs['date'], hdrs['subject'], hdrs['message-id'], key, pathname
 L.append(info)
 
 # sort the list
 # because we have date then subject in the tuple, the sort order is date then 
 subject
 # (then message-id, then key)
 L = sorted(L)
 
 # this last bit could be adapted to move every second message elsewhere
 for i in range(0, len(L), 2):
 date, subject, message_id, key, pathname = L[i]
 fp.close()
... decide what to do...
 
 The last loop iterates 0, 2, 4,... up to the largest index in the list L.
 
 Pulling every second message like this is very fragile - you needed
 to be totally sure that you had an exactly duplicated set of messages.
 
 Personally, I would be inclined to make a dict instead of a list,
 mapping message-ids to a list of message paths (or the info tuples).
 Then you can iterate over the dict and remove or move sideways the
 second and following messages for each message-id, leaving only the
 original.
 
 I'd also be writing this script to print a report instead of
 moving/deleting. Then I can examine the output for sanity before
 hitting the button. If the report went:
 
 pathname message-id date subject
 
 it would be easy to read the pathnames from a second script to do
 the actual message removal. Or whatever.
 
 Please feel free to ask whatever questions you like. I do a lot of
 stuff with Maildirs and Python; I replaced procmail with my own
 mail filing program a year or so ago.

the only thing left for me to do is following the good example of Maurice
speaking out my regards for this deep-in-detail answer. Thank you so much for
your effort! In the way you were explaining those two lines of code makes it
easy to understand and, in fact, is a perfect start to learn python. Even if
that wasn't my intention in the first place ;-) Thank you, Cameron!

cheers,
jonas

 
 Cheers,
 -- 
 Cameron Simpson c...@zip.com.au
 
 Q: How many user support people does it take to change a light bulb?
 A: We have an exact copy of the light bulb here and it seems to be
 working fine. Can you tell me what kind of system you have?

-- 
the basis of a healthy, tidy mind is a big trash basket. [Kurt Tucholsky]


Re: Yet another 'duplicate' thread

2013-11-14 Thread Gregor Zattler
Hi Jonas,
* Jonas Petong jonas.pet...@web.de [13. Nov. 2013]:
 On 13.Nov 2013, 13:01, Nathan Stratton Treadway wrote:
 On Wed, Nov 13, 2013 at 18:50:44 +0100, Jonas Petong wrote:
 Cameron, you were right, the message id's are the same. From the matter of 
 fact
 that limiting my Inbox by ~= did not work led me to the conclusion that 
 their
 IDs have been different. Seems like you've teached me wrong so.
 
 What happens when you try to limit by ~= ?  
 
 (Note that as I understand this limit only works when the sort order is
  thread.  That is, with no limit applied you should be seeing the
  duplicate messages marked with an = character your mailbox index
  listing, and then those marked messages will be selected by the ~=
  filter.)
 
 solved... 

but then, why limit the view?  Delete the duplicate message
right away:

1) open mailfoder in question
2) switch to threaded view (per default the key binding is ot)
3) delete-pattern ~= (per default the key-binding is D~=RETURN
4) carefully examine if the right messages are flagged with a D
5) expunge the messages via sync-messages (default key-binding
   in index is $).

Done.

HTH, Gregor


Re: Yet another 'duplicate' thread

2013-11-13 Thread Jonas Petong
On 13.Nov 2013, 00:48, Ken Moffat wrote:
 On Tue, Nov 12, 2013 at 07:22:24PM +0100, Jonas Petong wrote:
  Today I accidentally copied my mails into the same folder where they had 
  been
  stored before (evil keybinding!!!) and now I'm faced with about a 1000 
  copies
  within my inbox. Since those duplicates do not have a unique mail-id, it's
  hopeless to filter them with mutts integrated duplicate limiting pattern.
  Command 'limit~=' has no effect in my case and deleting them by hand
  will take me hours!
  
  I know this question has been (unsuccessfully) asked before. Anyhow is 
  there is
  a way to tag every other mail (literally every nth mail of my inbox-folder) 
  and
  afterwards delete them? I know something about linux-scripting but 
  unfortunately
  I have no clue where to start with and even which script-language to use.
  
  This close-to-topic approach with 'fdupes' has been released some time ago
  (http://consolematt.wordpress.com/tag/fdupes/) but in my view it seems way 
  to
  complicated. As I could recognize from mutts mailing archive, I'm not the 
  only
  one who has had trouble with it. Therefore I appreciate any hint which 
  drives me
  into the right direction and helps me solving this.
  
  Running Mutt 1.5.21 under Ubuntu Gnome 13.10. (Linux 3.11.0-13-generic).
  
  I don't have a script, but I usually view lists without threading,
 using date/time sent in sender's timezone (%d) - I'm sure that using
 the local time zone (%D) probably works the same way.  On occasion I've
 had to change which of my upstreams was subscribed to heavy-traffic
 lists such as lkml, and at other times I've occasionally had mails
 appearing twice after upstream problems.  When needed, it's just a
 case of looking at the index and deleting every other mail.
 Tedious, but achievable - particularly for only 1000 mails - I've
 done more than that in the past ;-)

me too, but I thought that was kind of a waste of time if there was a
possibility to solve this with a script automatically. Or even better within
mutt itself. By the way I'm a bit worried about my 'j' key ;-)

 
  I believe the order in which I see mails is governed by
 index_format [ I haven't looked at this stuff in ages - why break
 what works for me ]. Mine is:
 
 set index_format=%4C %Z %{%b %d} %-15.15n (%?l?%4l%4c?) %s

looks pretty much like mine.

  If you aren't a reckless person, turn off incoming mail and backup
 the directory or mbox before you try *any* solution.

thank you for that one, I mean it! Wouldn't be the first time trying to restore
old folders from my external backup drive. Just stored a copy of my ~/Mails :-)

 
 ĸen
 -- 
 das eine Mal als Tragödie, dieses Mal als Farce

-- 
the basis of a healthy, tidy mind is a big trash basket. [Kurt Tucholsky]


Re: Yet another 'duplicate' thread

2013-11-13 Thread Jonas Petong
On 13.Nov 2013, 13:01, Nathan Stratton Treadway wrote:
 On Wed, Nov 13, 2013 at 18:50:44 +0100, Jonas Petong wrote:
  Cameron, you were right, the message id's are the same. From the matter of 
  fact
  that limiting my Inbox by ~= did not work led me to the conclusion that 
  their
  IDs have been different. Seems like you've teached me wrong so.
 
 What happens when you try to limit by ~= ?  
 
 (Note that as I understand this limit only works when the sort order is
 thread.  That is, with no limit applied you should be seeing the
 duplicate messages marked with an = character your mailbox index
 listing, and then those marked messages will be selected by the ~=
 filter.)

solved... this is really a newbies error: not reading the manual properly -.-
Sorry for that one!  Cameron, could you explain me anyhow how to use that script
you proposed? Or at least which environment to set? Might be of use for further
stuck in nowhere problems (even if for no reason as in my case). You all have
a great day!

 
   Nathan

-- 
the basis of a healthy, tidy mind is a big trash basket. [Kurt Tucholsky]


Re: Yet another 'duplicate' thread

2013-11-13 Thread Maurice McCarthy
Please excuse a numpty interrupting, but could an old procmail recipe
be adapted for use here. What I've got I don't understand and it was
poached from somewhere or other

# Get rid of duplicates
:0 Whc: .msgid.lock
| formail -D 16384 .msgid.cache
:0 a
/dev/null


Regards
Maurice


Re: Yet another 'duplicate' thread

2013-11-13 Thread Cameron Simpson
On 13Nov2013 20:20, Maurice McCarthy mansel...@gmail.com wrote:
 Please excuse a numpty interrupting, but could an old procmail recipe
 be adapted for use here. What I've got I don't understand and it was
 poached from somewhere or other
 
 # Get rid of duplicates
 :0 Whc: .msgid.lock
 | formail -D 16384 .msgid.cache
 :0 a
 /dev/null

I prefer to do this in mutt using the ~= search (matches messages
that are dupes of other messages). It is more visible. FWIW, I
used to use the above procmail recipe, before deciding to do it in
mutt.

The above recipe uses formail to consult a tiny database where it
keeps the most recent 16384 message-ids seen. If the current message's
message-id is already there it it exits successfully. This is the
condition for the actual filing target /dev/null. So: if already
seen, file message to /dev/null (discard it).

From man formail:

   −D maxlen idcache
  Formail  will  detect if the Message‐ID of the current message
  has already been seen using an idcache file  of  approximately
  maxlen size.  If not splitting, it will return success if a
  duplicate has been found.  If splitting, it will not output
  duplicate  messages.  If  used in  conjunction  with  −r,
  formail will look at the mail address of the envelope sender
  instead at the Message‐ID.

I think it also adds the new message-id if unseen.

I do this in mutt for a few reasons:

  - this recipe prevents one from refiling a message.
Scenario: change filing rules, submit misfiled message to the new rules.
Result: message thrown away.

  - using mutt makes the discard visible.
(except that I have an unconditional folder-hook to discard ~= messages
on entry anyway now)
At least it is per folder and does not prevent me refiling.

  - I no longer use procmail to file my mail, preferring a tool of
my own called mailfiler.

Cheers,
--
Cameron Simpson c...@zip.com.au

Since I've mentioned the subject of geneology, I'll repeat a story I heard
about a poor fellow over on airstrip one.  Seems he spent the most recent
thirty years of his life tracking down his family history.  Spent hundreds
of pounds, traveled, devoted his life to it.  Then, last month, a cousin
told him he was adopted.  Ahhh, sweet irony.
- Tim_Mefford t...@physics.orst.edu


Re: Yet another 'duplicate' thread

2013-11-13 Thread Cameron Simpson
On 13.Nov 2013, 13:01, Nathan Stratton Treadway wrote:
 (Note that as I understand this limit only works when the sort order is
  thread.  That is, with no limit applied you should be seeing the
  duplicate messages marked with an = character your mailbox index
  listing, and then those marked messages will be selected by the ~=
  filter.)

Worth restating. This is something of a mutt annoyance - silent failure.

On 13Nov2013 20:38, Jonas Petong jonas.pet...@web.de wrote:
 Sorry for that one!  Cameron, could you explain me anyhow how to use that 
 script
 you proposed? Or at least which environment to set? Might be of use for 
 further
 stuck in nowhere problems (even if for no reason as in my case). You all 
 have
 a great day!

Well, the script as supplied is pseudocode (and of course untested),
but based around using Python. (If you don't know Python, it is
well worth learning.)

A fuller (but still totally untested) sketch might look like this:

  #!/usr/bin/python

  import sys
  import email.parser
  from mailbox import Maildir

  # get the maildir pathname from the command line
  mdirpath = sys.argv[1]

  # open the Maildir
  M = Maildir(mdirpath)

  # list holding message information
  L = []
  for key in M.keys():
# open the message file
fp = M.get_file(key)
# load the headers from this message
hdrs = email.parser.Parser().parse(fp, headersonly=True)
# speculative: get the filename of the message
pathname = fp.name
fp.close()
# make a tuple with the info we want
info = hdrs['date'], hdrs['subject'], hdrs['message-id'], key, pathname
L.append(info)

  # sort the list
  # because we have date then subject in the tuple, the sort order is date then 
subject
  # (then message-id, then key)
  L = sorted(L)

  # this last bit could be adapted to move every second message elsewhere
  for i in range(0, len(L), 2):
date, subject, message_id, key, pathname = L[i]
fp.close()
... decide what to do ...

The last loop iterates 0, 2, 4, ... up to the largest index in the list L.

Pulling every second message like this is very fragile - you needed
to be totally sure that you had an exactly duplicated set of messages.

Personally, I would be inclined to make a dict instead of a list,
mapping message-ids to a list of message paths (or the info tuples).
Then you can iterate over the dict and remove or move sideways the
second and following messages for each message-id, leaving only the
original.

I'd also be writing this script to print a report instead of
moving/deleting. Then I can examine the output for sanity before
hitting the button. If the report went:

  pathname message-id date subject

it would be easy to read the pathnames from a second script to do
the actual message removal. Or whatever.

Please feel free to ask whatever questions you like. I do a lot of
stuff with Maildirs and Python; I replaced procmail with my own
mail filing program a year or so ago.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

Q: How many user support people does it take to change a light bulb?
A: We have an exact copy of the light bulb here and it seems to be
   working fine.  Can you tell me what kind of system you have?


Re: Yet another 'duplicate' thread

2013-11-13 Thread Maurice McCarthy
Cameron

Many thanks indeed for taking the time to write out a detailed explanation!

Best Regards
Maurice

On 13/11/2013, Cameron Simpson c...@zip.com.au wrote:
 On 13Nov2013 20:20, Maurice McCarthy mansel...@gmail.com wrote:
 Please excuse a numpty interrupting, but could an old procmail recipe
 be adapted for use here. What I've got I don't understand and it was
 poached from somewhere or other

 # Get rid of duplicates
 :0 Whc: .msgid.lock
 | formail -D 16384 .msgid.cache
 :0 a
 /dev/null

 I prefer to do this in mutt using the ~= search (matches messages
 that are dupes of other messages). It is more visible. FWIW, I
 used to use the above procmail recipe, before deciding to do it in
 mutt.


Yet another 'duplicate' thread

2013-11-12 Thread Jonas Petong
Today I accidentally copied my mails into the same folder where they had been
stored before (evil keybinding!!!) and now I'm faced with about a 1000 copies
within my inbox. Since those duplicates do not have a unique mail-id, it's
hopeless to filter them with mutts integrated duplicate limiting pattern.
Command 'limit~=' has no effect in my case and deleting them by hand
will take me hours!

I know this question has been (unsuccessfully) asked before. Anyhow is there is
a way to tag every other mail (literally every nth mail of my inbox-folder) and
afterwards delete them? I know something about linux-scripting but unfortunately
I have no clue where to start with and even which script-language to use.

This close-to-topic approach with 'fdupes' has been released some time ago
(http://consolematt.wordpress.com/tag/fdupes/) but in my view it seems way to
complicated. As I could recognize from mutts mailing archive, I'm not the only
one who has had trouble with it. Therefore I appreciate any hint which drives me
into the right direction and helps me solving this.

Running Mutt 1.5.21 under Ubuntu Gnome 13.10. (Linux 3.11.0-13-generic).

cheers,
jonas



Re: Yet another 'duplicate' thread

2013-11-12 Thread Chris Down
On 2013-11-12 19:22:24 +0100, Jonas Petong wrote:
 Today I accidentally copied my mails into the same folder where they had been
 stored before (evil keybinding!!!) and now I'm faced with about a 1000 copies
 within my inbox. Since those duplicates do not have a unique mail-id, it's
 hopeless to filter them with mutts integrated duplicate limiting pattern.
 Command 'limit~=' has no effect in my case and deleting them by hand
 will take me hours!
 
 I know this question has been (unsuccessfully) asked before. Anyhow is there 
 is
 a way to tag every other mail (literally every nth mail of my inbox-folder) 
 and
 afterwards delete them? I know something about linux-scripting but 
 unfortunately
 I have no clue where to start with and even which script-language to use.

for every file:
read file and put the message-id in a dict in { message-id: [file1, 
file2..fileN] } order

for each key in that dict:
delete all filename values except the first

It should not be very complicated to write. If nobody else comes up with
something, I can possibly it for you after work.


pgpfkgvJm0Edy.pgp
Description: PGP signature


Re: Yet another 'duplicate' thread

2013-11-12 Thread Cameron Simpson
On 13Nov2013 09:06, Chris Down ch...@chrisdown.name wrote:
 On 2013-11-12 19:22:24 +0100, Jonas Petong wrote:
  Today I accidentally copied my mails into the same folder where they had 
  been
  stored before (evil keybinding!!!) and now I'm faced with about a 1000 
  copies
  within my inbox. Since those duplicates do not have a unique mail-id, it's
  hopeless to filter them with mutts integrated duplicate limiting pattern.
  Command 'limit~=' has no effect in my case and deleting them by hand
  will take me hours!
  
  I know this question has been (unsuccessfully) asked before. Anyhow is 
  there is
  a way to tag every other mail (literally every nth mail of my inbox-folder) 
  and
  afterwards delete them? I know something about linux-scripting but 
  unfortunately
  I have no clue where to start with and even which script-language to use.
 
 for every file:
 read file and put the message-id in a dict in { message-id: [file1, 
 file2..fileN] } order
 
 for each key in that dict:
 delete all filename values except the first
 
 It should not be very complicated to write. If nobody else comes up with
 something, I can possibly it for you after work.

Based on Jonas' post:

 Since those duplicates do not have a unique mail-id, it's hopeless
 to filter them with mutts integrated duplicate limiting pattern.
 Command 'limit~=' has no effect

I'd infer that the message-id fields are unique.

Jonas:

_Why_/_how_ did you get duplicate messages with distinct message-ids?
Have you verified (by inspecting a pair of duplicate messages) that
their Message-ID headers are different?

If the message-ids are unqiue for the duplicate messages I would:

  Move all the messages to a Maildir folder if they are not already so.
This lets you deal with each message as a distinct file.

  Write a script long the lines of Chris Down's suggestion, but collate
  messages by subject line, and store a tuple of:
(message-file-path, Date:-header-value, Message-ID:-header-value)

You may then want to compare messages with identical Date: values.

Or, if you are truly sure that the folder contains an exact and complete 
duplicate:
load all the filenames, order by Date:-header, iterate over the list (after 
ordering)
and _move_ every second item into another Maildir folder (in case you're wrong).

  L = []
  for each Maildir-file-in-new,cur:
load in the message headers and get the Date: header string
L.append( (date:-value, subject:-value, maildir-file-path) )

  L = sorted(L)
  for i in range(0, len(L), 2):
move the file L[i][1] into another directory

Note that you don't need to _parse_ the Date: header; if these are
duplicated messages the literal text of the Date: header should be
identical for the adjacent messages. HOWEVER, you probably want to
ensure either that all the identical date/subject groupings are
only pairs, in case of multiple distinct messages with identical
dates.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

If you can't annoy somebody, there's little point in writing.
- Kingsley Amis