Title: laola2html.pl

[apology for HTML email - MS Exchange ignores my Outlook client's plain text settings - known bug]

I recently asked about using doc2html with word2x, for indexing Word documents.  I have since found that word2x, at least on my system, would hang on some documents and stop the whole dig dead.  This is probably just my system, not word2x.

I have since switched to using the perl library LAOLA

http://snake.cs.tu-berlin.de:8081/~schwartz/pmh/

which is working like a charm.  I modified David Adam's pdf2html.pl to make a laola2html.pl, complete with title, subject, and keywords extraction.  It's depressing (and sometimes funny) how few authors set these attributes, by the way.  Watch out for "Sample Manual Title" and such ;)

Anyway, I attach it here for anyone to use, and also if anyone has suggestions for improvement.  As I said, I basically hacked up pdf2html.pl, so I'm sure this could be optimized and improved.

-Greg Holmes

 

laola2html.pl

Reply via email to