According to Geoff Hutchison:
> On Tue, 15 Jun 1999, Marian Steinbach wrote:
> >    Is their a universal way to achieve indexing PDF?
> 
> I'll give a fairly short answer, I'm sure others will probably correct me
> if I'm wrong.
> 
> Yes and no.
> 
> Some programs write PDF files as graphics. This, of course, defeats the
> whole purpose of the format, but it makes it essentially impossible to
> index.
> 
> For the vast majority of PDF files, you'll do very well setting an
> external parser to parse_doc.pl and using xpdf. There has been quite a bit
> of discussion on this point, and I expect a search for xpdf should turn up
> a bunch.

No universal way, but many of us have found that pdftotext (which comes
with xpdf 0.80) is the best tool for the job.  Use it in conjunction with
parse_doc.pl, as described in

        http://www.htdig.org/FAQ.html#q4.9

You can get the script from

        http://www.htdig.org/files/contrib/parsers/

or from the contrib directory in the source for ht://Dig 3.1.2.  The
contrib/parsers/ directory on the web site also includes a couple patches
for xpdf 0.80, to improve its handling of oddball spacing in some PDFs
(xpdf-0.80-deltax.patch), and to add a -rawdump option to pdftotext for
indexing multi-column PDFs (xpdf-0.80-rawdump.patch).

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to