On 4/14/2015 9:25 PM, Jeff Breidenbach wrote:
* I recommend doing the import all at once, rather than in
stages. Not for technical reasons, it just saves manual labor.

OK, I may do it in 2 stages, since 1/2 the archives are in mbox format that can be imported instantly. The other half are in Digests that I need to create the scripts and process them through a few thousand emails, which could take some weeks. But no more than 2 stages, if that's OK ...?

* Happy to make a tarball of the HTML after the import. ...
Either way, you would be totally on your own from there;

Sounds good and more than reasonable.

* Threading is done MHonArc and discussed here.
http://www.mail-archive.com/faq.html#threading

OK, that was helpful. Basically that References: and In-Reply-To: are used first, but in their absence, Subject: matching is used. That works.

It didn't, however, answer my "statute of limitations" question, i.e., in the absence of Message-ID: clues, and relying on Subject: only, if there was a time span beyond which it would not link two likely different threads that happened to have the same subject (e.g., someone used the same subject but a year later). Not a big deal anyway--just curiosity.

On 4/14/2015 11:27 AM, Earl Hood wrote:
For MIME digest messages, MUAs like nmh are able to extract such
messages out into individual files, which can be subsequently packed
into mbox format.

I have hundreds of digests each with perhaps a dozen messages to process, so it needs to be a script that basically creates "mbox of all messages" from "mbox of Digests" in one (or a few) fell swoop. I have a psuedo-code awk program written that should do that, that I need to further code into real awk, but it should work, just a question of time to write and debug it, and then to process all the files through it and verify that it worked OK. I did find a sed script at http://sed.sourceforge.net/grabbag/scripts/splitdig.sed that allegedly does this, but while it's instructional, it doesn't seem robust enough, seemingly relying on any line with all dashes (as few as one!) as being a demarcation, whereas that could easily occur in the text (preceding a signature, for example, or as a separator). Besides, sed is a "write-only" language (you can program it as you go if you know it well enough, but good luck going back and figuring out what a sed program actually does and how to modify or tweak it!). :-)

One more question: Some of the "mbox" files that I propose to submit are in fact Thunderbird mail folders. As far as I can tell, they conform entirely with mbox format (leading ^From_ line, escaped >From if leading in body), but does anyone know of any "gotchas" with this?

Also, my files are all Windows based, but I assume CR/LF vs. LF is handled automatically and trivially, correct? (Or should I do the conversion to Unix format myself?)

Shahrukh

_______________________________________________
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Reply via email to