I'm guessing you're hitting unescaped From lines. This is my script for processing individual messages.
pid=$$ if [ -z "$*" ] then exit fi for f in $* do cat $f | sed -e '/^$/ q' >head.$pid cat $f | sed -e '1,/^$/ d' >tmp.$pid cat tmp.$pid | sed -e 's/^From/>From/' -e 's/^\./ \./' >body.$pid cat head.$pid body.$pid >$f done rm head.$pid tmp.$pid body.$pid >Hi Folks, > >I've been trying to migrate a collection of messages from yahoogroups to >sympa (which uses mhonarc as it's archiving engine). > >There's a great little script, yahoo2mbox, that pulls messages form >yahoogroups and aggregates them into an mbox file - ideal for processing >by mhonarc. > >Unfortunately, when I run mhonarc on the mbox file, it seems to cut out >the bodies of a lot of, but not all of the messages - leaving the header >intact. It seems like messages that originated with MS Outlook are >particularly likely to end up with empty bodies. > >Now I've read the archives of this list, and this seems to be a known >problem with mhonarc filtering out malformed HTML, but I haven't seen >any recent traffic indicating a solution of any sort. > >So... has anybody come up with a straightforward way to clean up an mbox >file sufficiently for mhonarc to process? (e.g. a way to run the mbox >file through HTML Tidy or some such)? Or can anybody offer some >suggestions, recipes, recent experiences, etc.? > >Thanks much, > >Miles Fidelman > > -- PEG Manager pegmgr at peg dot com
