According to David Adams: > Your config. file looks OK, but check that you don't have a space after any > of those end-of-line \ characters. > > Have you checked that the /usr/share/htdig/parse_doc.pl script runs OK from > the command line and does extract text from the .PDF files in question? > > In the long run you should consider changing to use an external converter, > rather than parse_doc.pl > The doc2html.pl script will provide more diagnostic information, including > how many characters it has extracted from each document. ...
I second that recommendation. parse_doc.pl should only be used if you're stuck with a pre-3.1.4 htdig that doesn't handle external converters like conv_doc.pl or doc2html.pl. > From: "Thierry FLORAC" <[EMAIL PROTECTED]> > > I'm actually using ht/dig-3.1.5, to index informations stored on a Debian > > GNU/Linux Apache server. > > My problem(s ?) is that I can't index PDF files correctly. The symptoms are > > as follow when running "rundig -a -v" : > > > > ... > > 26:26:1:http://dsi.onf.fr/docs/rapcarcenac.pdf: size = 448512 > > ... > > Deleted, no excerpt: 26/http://dsi.onf.fr/docs/rapcarcenac.pdf > > ... > > > > This error is displayed for every PDF file. > > What does this message meens ?? > > > > My htdig.conf looks like this : > > > > max_doc_size: 20000000 > > external_parsers: \ > > application/msword /usr/share/htdig/parse_doc.pl \ > > application/postscript /usr/share/htdig/parse_doc.pl \ > > application/pdf /usr/share/htdig/parse_doc.pl > > > > My parse_doc.pl script is configured to parse PDF files with pdftotext, > > which is installed as part of the xpdf-i package, but ht/dig seems to > > always use acroread, except when I define a "pdf_parser" option in > > htdig.conf. If I recall correctly from previous discussions on the list, Debian configures Apache to put out "; charset=..." on the Content-Type header, which confuses 3.1.5's external parser support. Try the 3.1.6 snapshot at http://www.htdig.org/files/snapshots/ -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

