[apology for HTML email - MS Exchange ignores my Outlook client's plain text settings - known bug]
I recently asked about using doc2html with word2x, for indexing Word documents. I have since found that word2x, at least on my system, would hang on some documents and stop the whole dig dead. This is probably just my system, not word2x.
I have since switched to using the perl library LAOLA
http://snake.cs.tu-berlin.de:8081/~schwartz/pmh/
which is working like a charm. I modified David Adam's pdf2html.pl to make a laola2html.pl, complete with title, subject, and keywords extraction. It's depressing (and sometimes funny) how few authors set these attributes, by the way. Watch out for "Sample Manual Title" and such ;)
Anyway, I attach it here for anyone to use, and also if anyone has suggestions for improvement. As I said, I basically hacked up pdf2html.pl, so I'm sure this could be optimized and improved.
-Greg Holmes
laola2html.pl
