Hello David, thank you very much for your support!
Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok, too. But when I run conv_doc.pl (or doc2html.pl ) from command line with a pdf-file as a argument I get error message: bad interpreter: no such file or directory/usr/bin/perl. What is here wrong??? best regards Natalya > OK, so far we have established: > > 1) Htdig is reading a .PDF file > 2) You are attempting to use /usr/local/bin/conv_doc.pl to convert it. > 3) No text is being extracted from the .PDF file, so it is not being > indexed. > > This suggests that the fault is with /usr/local/bin/conv_doc.pl. Please > try > executing this from the command line: > > /usr/local/bin/conv_doc.pl somepdffile.pdf > > where somepdffile.pdf is a PDF file from which it should be able to > extract > text. See what happens. > This is a necessary step in the diagnosis. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <[EMAIL PROTECTED]> > To: "Gilles Detillieux" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Thursday, October 09, 2003 9:51 AM > Subject: Re: [htdig] PDF-SEARCH > > > > Yes, I get error message "Deleted: no excerpt"!!! > > > > Natalya > > > > > According to Natalya Kolesnikova: > > > > Thank you, David, for your help! > > > > > > > > But when I run htmerge, I get follow message: > > > > htmerge: Document database has no URLs. Check your config file and > try > > > > running htdig again. > > > > > > Are there any other htmerge error messages, such as a "Deleted: no > > > excerpt" > > > message? I suspect what's happening here is that htdig adds the > single > > > URL for the PDF file, which you specify in start_url, to the database, > > > but when it tries to index it, it finds nothing to index. When > htmerge > > > sees that nothing was indexed for this one document, it removes it > from > > > the database, but then complains that there are no URLs left in the > > > database. > > > Seeing all the htmerge error messages (try htmerge -v after htdig) > would > > > give us a more complete picture. > > > > > > Please follow through on Dave's and my suggestions below... > > > > > > > > Ok, your configuration file contains: > > > > > > > > > > external_parsers: application/msword->text/html > > > /usr/local/bin/conv_doc.pl > > > > > \ > > > > > application/postscript->text/html > > > /usr/local/bin/conv_doc.pl > > > > > \ > > > > > application/pdf->text/html > /usr/local/bin/conv_doc.pl > > > > > > > > > > so you are using conv_doc.pl. > > > > > > > > > > Please check one thing in your configuration file: make sure there > are > > > no > > > > > white space characters after the \ characters at the end of lines, > > > this is > > > > > most important. > > > > > > My first hunch is that this isn't the problem, because if htdig didn't > > > see the full external_parsers definition (all 3 lines of it), it > likely > > > would be trying to use acroread and the PDF:: class, so we'd see > messages > > > >from there. However, it's an easy thing to check for, and always a > good > > > idea to pay close attention to in any case, so please do have a look > at > > > these lines. > > > > > > > > If your configuration file is OK, then the problem must be with > > > > > /usr/local/bin/conv_doc.pl or the utilities it calls. > > > > > Try running /usr/local/bin/conv_doc.pl from the command line with > a > > > .PDF > > > > > file as argument and see what the result is. > > > > > > This is a very important test. Your first test, with the start_url > set > to > > > > > > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > > > showed that it failed with this single PDF file, which suggests a > problem > > > either with that PDF file or with the setup of the external parser. > > > The next step is to find out which is at fault, and this test will do > > > that. If it fails on the introduction_to_IPR.pdf file (i.e. it > produces > > > no output), try it on a few other files as well. If it doesn't work > on > > > any of them, I'd suspect that conv_doc.pl is not properly configured. > > > In this case, you should try pdftotext directly on these PDF files to > > > see if that works. > > > > > > If it produces output for some PDF files, but not others, it may be > that > > > the ones for which it produces nothing actually contain no indexable > text. > > > Some PDF files contain only image data, including perhaps scanned > pages > > > that display as text, but in fact are only a "picture" of a page. > > > > > > Once you can get conv_doc.pl to spit out text when run manually, > > > the following step will be to try htdig on those same PDF files, > > > one at a time, using htdig -ivvvv (note: 4 "v" options this time, > > > so htdig shows each word it parses). If you get that far, then the > > > next stage would be to use your original start_url to index your whole > > > site, and see if it will find all the PDF files. If it doesn't, see > > > http://www.htdig.org/FAQ.html#q5.27 > > > > > > -- > > > Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> > > > Spinal Cord Research Centre WWW: > http://www.scrc.umanitoba.ca/ > > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > -- NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse f�r Mail, Message, More! +++ ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

