According to Fabio Barone:
> Another guy having trouble parsing pdf files..
> 
> I read the FAQ stating that when you get messages like
> 
> PDF:cannot find acroread
> 
> there's the default max_doc_size that has to be increased.
> Well, I've done that but I still get the messages:
> 
> ..
> Read 8192 from document
> Read 5391 from document
> Read a total of 169231 bytes
> PDF::setContents(169231 bytes)
> PDF::parse(http://sweweb/~fba/cookbook3.pdf)
> PDF::parse: cannot find acroread
>  size = 169231
> pick: sweweb:80, # servers = 1
> 
> where max_doc_size is 200000
> 
> Could it be that the pdf parser can't resolve the (redirect) URL to the file
> location? (sweweb/~fba/)

As the error message says, the problem is htdig can't find acroread.  Do you
have Adobe Acrobat Reader installed on your system?  If so, where is the
acroread command?  It usually is in /usr/local/bin, which I think is where
the htdig configuration procedure expects to find it.

By the way, that error message is just in the old beta releases, 3.1.0b4
and previous.  The latest release (3.1.1) works a little differently.

If you don't have acroread, you can use an external parser instead.
You can use the latest version of the parse_doc.pl script as an external
parser for files of the application/pdf type.  It uses pdftotext (from
the xpdf 0.80 package) to extract the text from the PDF file, and formats
the text as required by htdig.  If you're going to use external parsers,
you really ought to upgrade to htdig 3.1.1, though, because a lot of fixes
have gone into external parser support recently.  The latest version of
parse_doc.pl can be taken from:

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

and documentation on external parsers is at:

        http://www.htdig.org/attrs.html#external_parsers

Hope this helps.  I think the FAQ needs to say a lot more about PDFs than
it does now.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to