According to David Adams:
> Your config. file looks OK, but check that you don't have a space after any
> of those end-of-line \ characters.
> 
> Have you checked that the /usr/share/htdig/parse_doc.pl script runs OK from
> the command line and does extract text from the .PDF files in question?
> 
> In the long run you should consider changing to use an external converter,
> rather than parse_doc.pl
> The doc2html.pl script will provide more diagnostic information, including
> how many characters it has extracted from each document.
...

I second that recommendation.  parse_doc.pl should only be used if you're
stuck with a pre-3.1.4 htdig that doesn't handle external converters like
conv_doc.pl or doc2html.pl.

> From: "Thierry FLORAC" <[EMAIL PROTECTED]>
> > I'm actually using ht/dig-3.1.5, to index informations stored on a Debian
> > GNU/Linux Apache server.
> > My problem(s ?) is that I can't index PDF files correctly. The symptoms are
> > as follow when running "rundig -a -v" :
> >
> >   ...
> >   26:26:1:http://dsi.onf.fr/docs/rapcarcenac.pdf:  size = 448512
> >   ...
> >   Deleted, no excerpt: 26/http://dsi.onf.fr/docs/rapcarcenac.pdf
> >   ...
> >
> > This error is displayed for every PDF file.
> > What does this message meens ??
> >
> > My htdig.conf looks like this :
> >
> >   max_doc_size:           20000000
> >   external_parsers: \
> >                 application/msword /usr/share/htdig/parse_doc.pl \
> >                 application/postscript /usr/share/htdig/parse_doc.pl \
> >                 application/pdf /usr/share/htdig/parse_doc.pl
> >
> > My parse_doc.pl script is configured to parse PDF files with pdftotext,
> > which is installed as part of the xpdf-i package, but ht/dig seems to
> > always use acroread, except when I define a "pdf_parser" option in
> > htdig.conf.

If I recall correctly from previous discussions on the list, Debian
configures Apache to put out "; charset=..." on the Content-Type header,
which confuses 3.1.5's external parser support.  Try the 3.1.6 snapshot
at http://www.htdig.org/files/snapshots/

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to