Re: htdig: ht3.1.0b1 and PDF

Michael J. Long Fri, 11 Sep 1998 17:13:54 -0400
Geoff Hutchison wrote:

[...snip...]

> there was a lot of discussion about
> using other programs to parse PDF files. I don't think anyone has tested
> using other programs,

I have looked at the output from acroread and from xpdf's version of
pdftops and they differ slightly.  Sylvain's PDF module uses acroread
specific tags (BT and ET) to determine where to start searching for
words to index.  Unfortunately, pdftops does not insert these tags into
the PostScript output.

Therefore, the PDF module will not work with pdftops as is.  I have some
theories on how to tweak the PDF module to work with both:
        - convert the pdf to ps and use the Postscript module to
          parse it (looking at the way the modules work, I don't
          know if this is possible, I haven't look at it that much
          though)
        - convert the pdf to text and parse the text
        - improve the parsing capability by stealing code from
          the Postscript module

Anyone out there have any nuggets of wisdom you can impart?

> but I figured it would be better to name it
> "pdf_parser" than "acroread" anyway.

Good choice.

[...snip...]

Michael J. Long

-- 
* Michael J. Long * #include "std/disclaimer.h"
*   Summa Four    * Work: [EMAIL PROTECTED]
* Manchester, NH  * Play: [EMAIL PROTECTED]
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
Re: htdig: ht3.1.0b1 and PDF

Reply via email to