Geoff Hutchison wrote:
[...snip...]
> there was a lot of discussion about
> using other programs to parse PDF files. I don't think anyone has tested
> using other programs,
I have looked at the output from acroread and from xpdf's version of
pdftops and they differ slightly. Sylvain's PDF module uses acroread
specific tags (BT and ET) to determine where to start searching for
words to index. Unfortunately, pdftops does not insert these tags into
the PostScript output.
Therefore, the PDF module will not work with pdftops as is. I have some
theories on how to tweak the PDF module to work with both:
- convert the pdf to ps and use the Postscript module to
parse it (looking at the way the modules work, I don't
know if this is possible, I haven't look at it that much
though)
- convert the pdf to text and parse the text
- improve the parsing capability by stealing code from
the Postscript module
Anyone out there have any nuggets of wisdom you can impart?
> but I figured it would be better to name it
> "pdf_parser" than "acroread" anyway.
Good choice.
[...snip...]
Michael J. Long
--
* Michael J. Long * #include "std/disclaimer.h"
* Summa Four * Work: [EMAIL PROTECTED]
* Manchester, NH * Play: [EMAIL PROTECTED]
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.