Questions about importing mail (mbox)

2011-04-16 Thread Pieter Praet
On Mon, 21 Mar 2011 19:02:45 -0700, Mueen Nawaz  wrote:
> I think you misunderstood me. A part of me suspects this has something
> to do with my not explaining myself, but who's to say?

Same here, apparently :D

> I'm experimenting with notmuch, and if I can translate everything I
> currently do in mutt to notmuch, then I'll just dump mutt. The set of
> mboxes I have will remain archived, but for all future incoming email,
> I'll switch to MH or MailDir. So I don't actually need to put my old
> mboxes under revision control - I just need to save them somewhere.

I strongly agree that long term storage choices are a matter of personal
opinion, however the intention of my proposition was to simply keep
track of what changed in the mbox as a result of the various ops
performed, as to gain insight in what gets messed up and where.

Non-VCS would be something along the lines of:
compact mbox.orig > mbox.comp   # (*if* "compact" were a valid command)
diff mbox.orig mbox.comp
mb2md -s ./mbox.comp -d ./maildir
cat ./maildir/new/* >> mbox.conv
diff mbox.comp mbox.conv

> > For the actual conversion to Maildir (and any type of mail fetching in
> > general), I'd suggest using FDM [2], you'll never look back.
> 
> Thanks - will take a look.
> 
> > Regarding the significant discrepancy between processed and added files
> > in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
> > lists, ending up in both Inbox and Sent), which are automatically
> > suppressed by Notmuch.
> 
> It definitely was dupes. I didn't realize that notmuch did not keep
> track of dupes. 
> 
> So I wrote a Python script to go through the mboxes and do a count of
> only unique messages. Problem? I have over 1000 emails that don't have a
> Message-ID header (case invariant search). I could go over why that is,
> but suffice it to say that I hate Microsoft.
> 
> Once I remove all dupes, I get to within 300-400 of the count that
> notmuch provides. The remaining 1000+ emails do contain some dupes, and
> I can't find a convenient way to get an accurate count of unique emails
> from them, but at least now I'm in the ballpark, and a lot more
> confident.

Sadly, both mb2md and fdm *will* mess things up, since they both split
on every single occurence of "^From " [1,2], even if it isn't a
separator line.

Both assume occurences of "^From " in the message body to be already
escaped like so: "^>From " [3,4].

Even worse, RFC 4155 [5] confirms this to be semi-expected behaviour:
>> Many implementations are also known to escape message body lines that
>> begin with the character sequence of "From ", so as to prevent
>> confusion with overly-liberal parsers that do not search for full
>> separator lines.  In the common case, a leading Greater-Than symbol
>> (0x3E) is used for this purpose (with "From " becoming ">From ").
>> However, other implementations are known not to escape such lines
>> unless they are immediately preceded by a blank line or if they also
>> appear to contain an email address and a timestamp.  Other
>> implementations are also known to perform secondary escapes against
>> these lines if they are already escaped or quoted, while others
>> ignore these mechanisms altogether.

One way to circumvent this is by making use of the Content-Length header
(which is apparently how Mutt does it [6]), but guess what, it suffers
the same fate as Message-ID...

> Incidentally, one reason I didn't realize dupes were the reason is that
> I did a search for a word in one email I had and notmuch did not find
> it - so I assumed it had not been indexed. Later on, I realized I had
> written a partial word and discovered that notmuch does find it if I
> type the full word.
> 
> What am I doing wrong? Can't notmuch handle partial word matches? Do I
> need to specify an option to get that to work?

AFAIK, this depends on how Xapian splits terms, so isn't a Notmuch issue.
Globbing helps (sometimes).

query: "partia AND from:mueen at nawaz.org"
returns nil

query: "partia* AND from:mueen at nawaz.org"
correctly returns this thread.



Peace

-Pieter


[1] mb2md, line 999 (http://www.linuxkungfu.org/files/scripts/mb2md)
[2] fdm, line 461 
(http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[3] mb2md, line 1342 (http://www.linuxkungfu.org/files/scripts/mb2md)
[4] fdm, line 468 
(http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[5] RFC 4155, section 2, paragraph 5 (http://tools.ietf.org/html/rfc4155)
[6] http://www.mail-archive.com/mutt-users at mutt.org/msg21921.html


Re: Questions about importing mail (mbox)

2011-04-16 Thread Pieter Praet
On Mon, 21 Mar 2011 19:02:45 -0700, Mueen Nawaz mu...@nawaz.org wrote:
 I think you misunderstood me. A part of me suspects this has something
 to do with my not explaining myself, but who's to say?G

Same here, apparently :D

 I'm experimenting with notmuch, and if I can translate everything I
 currently do in mutt to notmuch, then I'll just dump mutt. The set of
 mboxes I have will remain archived, but for all future incoming email,
 I'll switch to MH or MailDir. So I don't actually need to put my old
 mboxes under revision control - I just need to save them somewhere.

I strongly agree that long term storage choices are a matter of personal
opinion, however the intention of my proposition was to simply keep
track of what changed in the mbox as a result of the various ops
performed, as to gain insight in what gets messed up and where.

Non-VCS would be something along the lines of:
compact mbox.orig  mbox.comp   # (*if* compact were a valid command)
diff mbox.orig mbox.comp
mb2md -s ./mbox.comp -d ./maildir
cat ./maildir/new/*  mbox.conv
diff mbox.comp mbox.conv

  For the actual conversion to Maildir (and any type of mail fetching in
  general), I'd suggest using FDM [2], you'll never look back.
 
 Thanks - will take a look.
 
  Regarding the significant discrepancy between processed and added files
  in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
  lists, ending up in both Inbox and Sent), which are automatically
  suppressed by Notmuch.
 
 It definitely was dupes. I didn't realize that notmuch did not keep
 track of dupes. 
 
 So I wrote a Python script to go through the mboxes and do a count of
 only unique messages. Problem? I have over 1000 emails that don't have a
 Message-ID header (case invariant search). I could go over why that is,
 but suffice it to say that I hate Microsoft.G
 
 Once I remove all dupes, I get to within 300-400 of the count that
 notmuch provides. The remaining 1000+ emails do contain some dupes, and
 I can't find a convenient way to get an accurate count of unique emails
 from them, but at least now I'm in the ballpark, and a lot more
 confident.

Sadly, both mb2md and fdm *will* mess things up, since they both split
on every single occurence of ^From  [1,2], even if it isn't a
separator line.

Both assume occurences of ^From  in the message body to be already
escaped like so: ^From  [3,4].

Even worse, RFC 4155 [5] confirms this to be semi-expected behaviour:
 Many implementations are also known to escape message body lines that
 begin with the character sequence of From , so as to prevent
 confusion with overly-liberal parsers that do not search for full
 separator lines.  In the common case, a leading Greater-Than symbol
 (0x3E) is used for this purpose (with From  becoming From ).
 However, other implementations are known not to escape such lines
 unless they are immediately preceded by a blank line or if they also
 appear to contain an email address and a timestamp.  Other
 implementations are also known to perform secondary escapes against
 these lines if they are already escaped or quoted, while others
 ignore these mechanisms altogether.

One way to circumvent this is by making use of the Content-Length header
(which is apparently how Mutt does it [6]), but guess what, it suffers
the same fate as Message-ID...

 Incidentally, one reason I didn't realize dupes were the reason is that
 I did a search for a word in one email I had and notmuch did not find
 it - so I assumed it had not been indexed. Later on, I realized I had
 written a partial word and discovered that notmuch does find it if I
 type the full word.
 
 What am I doing wrong? Can't notmuch handle partial word matches? Do I
 need to specify an option to get that to work?

AFAIK, this depends on how Xapian splits terms, so isn't a Notmuch issue.
Globbing helps (sometimes).

query: partia AND from:mu...@nawaz.org
returns nil

query: partia* AND from:mu...@nawaz.org
correctly returns this thread.



Peace

-Pieter


[1] mb2md, line 999 (http://www.linuxkungfu.org/files/scripts/mb2md)
[2] fdm, line 461 
(http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[3] mb2md, line 1342 (http://www.linuxkungfu.org/files/scripts/mb2md)
[4] fdm, line 468 
(http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[5] RFC 4155, section 2, paragraph 5 (http://tools.ietf.org/html/rfc4155)
[6] http://www.mail-archive.com/mutt-users@mutt.org/msg21921.html
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Questions about importing mail (mbox)

2011-03-21 Thread Mueen Nawaz
Jesse Rosenthal  writes:

> I didn't need to convert when I started using notmuch, but for past
> mbox-to-maildir conversions, I always had the most confidence in using
> mutt interactively. Tag all messages (S-t, all), copy or save to a
> maildir, and make sure your mbox_type is set appropriately. There are
> scripts out there to automate it, but if you're worried about missing
> something, doing it by hand might work a bit better for you. (You can
> also do it in chunks by date to make sure everything is moving over.)
> Not the most efficient, but you should only have to do it once.

Thanks - will give it a try. It half solves my problem, in that I can do
a message count using mutt before and after to see the conversion went
well. The second issue is figuring out if notmuch really did index all
of them - challenging because I have plenty of dupes. I may just have to
take it all on faith for now.

As I had mentioned, when using going from MH to notmuch, it complained
for about 20 messages. I was in a hurry so didn't take a detailed look,
but two of them were clearly corrupt in my mbox file. They had a from
and virtually no other headers. So perhaps all the problems I'm having
stem from corrupt messages in my mbox...



Questions about importing mail (mbox)

2011-03-21 Thread Mueen Nawaz
Pieter Praet  writes:
> It would've been a no-brainer if you'd been using Maildir all along
> (mbox is evil incarnate), but...

Sure, but mbox is too convenient.

> I'd suggest keeping your original mbox file safe in git [1], and
> consistently commiting every step of the way, so even if messages were
> to get lost in translation, you still have a way to get them back, with
> negligible storage overhead (just remember to "git gc --aggressive
> --prune=now" when you're finished).

I think you misunderstood me. A part of me suspects this has something
to do with my not explaining myself, but who's to say?

I'm experimenting with notmuch, and if I can translate everything I
currently do in mutt to notmuch, then I'll just dump mutt. The set of
mboxes I have will remain archived, but for all future incoming email,
I'll switch to MH or MailDir. So I don't actually need to put my old
mboxes under revision control - I just need to save them somewhere.

> For the actual conversion to Maildir (and any type of mail fetching in
> general), I'd suggest using FDM [2], you'll never look back.

Thanks - will take a look.

> Regarding the significant discrepancy between processed and added files
> in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
> lists, ending up in both Inbox and Sent), which are automatically
> suppressed by Notmuch.

It definitely was dupes. I didn't realize that notmuch did not keep
track of dupes. 

So I wrote a Python script to go through the mboxes and do a count of
only unique messages. Problem? I have over 1000 emails that don't have a
Message-ID header (case invariant search). I could go over why that is,
but suffice it to say that I hate Microsoft.

Once I remove all dupes, I get to within 300-400 of the count that
notmuch provides. The remaining 1000+ emails do contain some dupes, and
I can't find a convenient way to get an accurate count of unique emails
from them, but at least now I'm in the ballpark, and a lot more
confident.

Incidentally, one reason I didn't realize dupes were the reason is that
I did a search for a word in one email I had and notmuch did not find
it - so I assumed it had not been indexed. Later on, I realized I had
written a partial word and discovered that notmuch does find it if I
type the full word.

What am I doing wrong? Can't notmuch handle partial word matches? Do I
need to specify an option to get that to work?

Anyway, thanks for the help - I'll investigate further.




Questions about importing mail (mbox)

2011-03-21 Thread Pieter Praet
On Sun, 20 Mar 2011 20:30:52 -0700, Mueen Nawaz  wrote:
> 
> Hi,
> 
> I'm trying to experiment with notmuch. 
> 
> As I understand it, notmuch does not handle mbox for input. The problem
> is that all my mail is currently in mbox format.
> 
> So I first tried converting mbox to maildir using mb2md.
> 
> It didn't do a good job. When I subsequently tried importing to notmuch,
> notmuch complained about lots of non-mail files - I confirmed that
> indeed mb2md had botched converting those emails.
> 
> So then I tried to convert to mh format using Sylpheed. This seemed to
> go well, but then when importing to notmuch, it complained again for
> about 20 emails, and a manual check confirmed that some messages did not
> get converted properly to mh (they don't show up in Sylpheed).
> 
> And then I noticed another discrepancy. mutt shows that I started with
> 44473 messages in mbox. When I imported into Sylpheed, it showed 44482
> messages (no idea where the extra 9 came from). However, notmuch is
> reporting that it processed 44482 files, but that it added 35602
> messages.
> 
> Why only 35602 (it complained for only about 20 messages)? A search
> confirmed that some messages that show up in both mutt (in mbox) and
> Sylpheed (in mh format) were not indexed.
> 
> So I want to know: When you guys switched to notmuch, how did you ensure
> you did not miss any emails. I really, really, really don't want to lose
> any emails in this process!
> 
> Thanks.
> 
> ___
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch


It would've been a no-brainer if you'd been using Maildir all along
(mbox is evil incarnate), but...

I'd suggest keeping your original mbox file safe in git [1], and
consistently commiting every step of the way, so even if messages were
to get lost in translation, you still have a way to get them back, with
negligible storage overhead (just remember to "git gc --aggressive
--prune=now" when you're finished).

Compacting the mbox file, i.e. purging all stale messages (sync-mailbox
in mutt?) and diffing to HEAD could then possibly give you an indication
as to the origin of the 9 surplus files.

For the actual conversion to Maildir (and any type of mail fetching in
general), I'd suggest using FDM [2], you'll never look back.

Regarding the significant discrepancy between processed and added files
in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
lists, ending up in both Inbox and Sent), which are automatically
suppressed by Notmuch.


[1] http://git-scm.com/
[2] http://fdm.sourceforge.net/


Questions about importing mail (mbox)

2011-03-21 Thread Jesse Rosenthal
On Sun, 20 Mar 2011 20:30:52 -0700, Mueen Nawaz  wrote:
> 
> So I want to know: When you guys switched to notmuch, how did you ensure
> you did not miss any emails. I really, really, really don't want to lose
> any emails in this process!

I didn't need to convert when I started using notmuch, but for past
mbox-to-maildir conversions, I always had the most confidence in using
mutt interactively. Tag all messages (S-t, all), copy or save to a
maildir, and make sure your mbox_type is set appropriately. There are
scripts out there to automate it, but if you're worried about missing
something, doing it by hand might work a bit better for you. (You can
also do it in chunks by date to make sure everything is moving over.)
Not the most efficient, but you should only have to do it once.

Best,
Jesse


Re: Questions about importing mail (mbox)

2011-03-21 Thread Jesse Rosenthal
On Sun, 20 Mar 2011 20:30:52 -0700, Mueen Nawaz mu...@nawaz.org wrote:
 
 So I want to know: When you guys switched to notmuch, how did you ensure
 you did not miss any emails. I really, really, really don't want to lose
 any emails in this process!

I didn't need to convert when I started using notmuch, but for past
mbox-to-maildir conversions, I always had the most confidence in using
mutt interactively. Tag all messages (S-t, all), copy or save to a
maildir, and make sure your mbox_type is set appropriately. There are
scripts out there to automate it, but if you're worried about missing
something, doing it by hand might work a bit better for you. (You can
also do it in chunks by date to make sure everything is moving over.)
Not the most efficient, but you should only have to do it once.

Best,
Jesse
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: Questions about importing mail (mbox)

2011-03-21 Thread Mueen Nawaz
Jesse Rosenthal jrosent...@jhu.edu writes:

 I didn't need to convert when I started using notmuch, but for past
 mbox-to-maildir conversions, I always had the most confidence in using
 mutt interactively. Tag all messages (S-t, all), copy or save to a
 maildir, and make sure your mbox_type is set appropriately. There are
 scripts out there to automate it, but if you're worried about missing
 something, doing it by hand might work a bit better for you. (You can
 also do it in chunks by date to make sure everything is moving over.)
 Not the most efficient, but you should only have to do it once.

Thanks - will give it a try. It half solves my problem, in that I can do
a message count using mutt before and after to see the conversion went
well. The second issue is figuring out if notmuch really did index all
of them - challenging because I have plenty of dupes. I may just have to
take it all on faith for now.

As I had mentioned, when using going from MH to notmuch, it complained
for about 20 messages. I was in a hurry so didn't take a detailed look,
but two of them were clearly corrupt in my mbox file. They had a from
and virtually no other headers. So perhaps all the problems I'm having
stem from corrupt messages in my mbox...

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Questions about importing mail (mbox)

2011-03-20 Thread Mueen Nawaz

Hi,

I'm trying to experiment with notmuch. 

As I understand it, notmuch does not handle mbox for input. The problem
is that all my mail is currently in mbox format.

So I first tried converting mbox to maildir using mb2md.

It didn't do a good job. When I subsequently tried importing to notmuch,
notmuch complained about lots of non-mail files - I confirmed that
indeed mb2md had botched converting those emails.

So then I tried to convert to mh format using Sylpheed. This seemed to
go well, but then when importing to notmuch, it complained again for
about 20 emails, and a manual check confirmed that some messages did not
get converted properly to mh (they don't show up in Sylpheed).

And then I noticed another discrepancy. mutt shows that I started with
44473 messages in mbox. When I imported into Sylpheed, it showed 44482
messages (no idea where the extra 9 came from). However, notmuch is
reporting that it processed 44482 files, but that it added 35602
messages.

Why only 35602 (it complained for only about 20 messages)? A search
confirmed that some messages that show up in both mutt (in mbox) and
Sylpheed (in mh format) were not indexed.

So I want to know: When you guys switched to notmuch, how did you ensure
you did not miss any emails. I really, really, really don't want to lose
any emails in this process!

Thanks.