[htdig] laola2html.pl

Holmes, Gregory Tue, 03 Jul 2001 10:46:23 -0700

Title: laola2html.pl

[apology for HTML email - MS Exchange ignores my Outlook client's plain text settings - known bug]

I recently asked about using doc2html with word2x, for indexing Word documents. I have since found that word2x, at least on my system, would hang on some documents and stop the whole dig dead. This is probably just my system, not word2x.

I have since switched to using the perl library LAOLA

http://snake.cs.tu-berlin.de:8081/~schwartz/pmh/

which is working like a charm. I modified David Adam's pdf2html.pl to make a laola2html.pl, complete with title, subject, and keywords extraction. It's depressing (and sometimes funny) how few authors set these attributes, by the way. Watch out for "Sample Manual Title" and such ;)

Anyway, I attach it here for anyone to use, and also if anyone has suggestions for improvement. As I said, I basically hacked up pdf2html.pl, so I'm sure this could be optimized and improved.

-Greg Holmes

laola2html.pl

[htdig] laola2html.pl

Reply via email to