You could could also start with the 'h2mbx.pl' script
http://www.albany.net/~anthonyw/archivedemo/script.txt
http://www.albany.net/~anthonyw/archivedemo/
and modify it to parse your html files.
On 26 Apr 2000, [ISO-8859-1] Fran�ois Pinard wrote:
> Louis Proyect <[EMAIL PROTECTED]> writes:
>
> > Has anybody written a perl script to convert mhonarc msg html to
> > standard Internet RSC mailbox format? I want to add old archives to
> > the mail-archive website, but neglected to save the mailbox data that
> > created them originally.
>
> I made the following script for one particular case, but since MHonArc
> is incredibly configurable, there is little chance for the script to
> work generally. But it might help you at getting started, who knows...
>
> To use it, I called a recursive `wget' on the archives, and from within
> the directory, did `unmhonarc * > ../FOLDER' to produce a single big FOLDER
> containing all the archives. Then, I digested that folder from within Gnus,
> and had fun for a good while, sorting out all the information!
>
> The following script is put in an executable file named `unmhonarc',
> as you guessed already :-).
>
>
> #!/usr/bin/env python
> # Rebuild simple messages from their HTML expression.
>
> import string, sys
>
> def main(*arguments):
> for file in arguments:
> sys.stderr.write("Processing %s ...\n" % file)
> lines = open(file).readlines()
> sys.stdout.write('From nobody@nowhere Sun Feb 13 06:46:37 2000\n')
> for counter in range(len(lines)):
> if lines[counter][0:4] == '<li>':
> break
> write_clean(lines[counter][4:])
> counter = counter + 1
> write_clean(lines[counter][4:])
> counter = counter + 1
> write_clean(lines[counter][4:])
> counter = counter + 1
> sys.stdout.write('Message-Id: <[EMAIL PROTECTED]>\n' % file)
> sys.stdout.write('\n')
> while counter < len(lines):
> if lines[counter] == '<PRE>\n':
> break
> counter = counter + 1
> counter = counter + 1
> while counter < len(lines):
> if lines[counter] == '</PRE>\n':
> break
> write_clean(lines[counter])
> counter = counter + 1
> sys.stdout.write('\n')
> sys.stdout.write('\n')
>
> def write_clean(line):
> line = string.replace(line, '<', '<')
> line = string.replace(line, '>', '>')
> line = string.replace(line, '&', '&')
> sys.stdout.write(line)
>
> if __name__ == '__main__':
> apply(main, tuple(sys.argv[1:]))
>
> --
> Fran�ois Pinard http://www.iro.umontreal.ca/~pinard
>
>
>
Regards,
AnthonyW