Ok, pdf-search runs! I try now to index .ppt and .xls Files: htdig.conf external_parsers: application/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ text/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/pdf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-excel->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-powerpoint->text/html /srv/www/htdig/doc2html/doc2html.pl\
ppthtml and xlhtml are working from command line ok. doc2html with .ppt-file or .xls-file as Argument is working ok, also. But if I run rundig, I neither see .ppt-files nor .xls-files indexing! best regards Natalya, > Glad that you have made progress. > > I don't recognize the "PRINT OUTPUNT!!!???" message, but to run doc2html. > pl > >from the command line it is necessary to give two arguments: > > doc2html.pl filename.pdf application/pdf > > It this fails try: > > pdf2html.pl filename.pdf > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <[EMAIL PROTECTED]> > To: "Martin Joisten" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Friday, October 10, 2003 12:27 PM > Subject: Re: [htdig] PDF-SEARCH > > > > Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!! > > > > If I run doc2html.pl with pdf-file as argument from command line, I ge > t > > PRINT OUTPUNT!!!??? > > > > > > best regards > > Natalya > > > > > Hi Natalya, > > > then it seems that the path to perl is wrong and that's why the Perl > > > Script(s) don't work. > > > > > > Check out the first lines of each Perl Script (.pl) and correct the p > ath > > > to perl. Maybe there isn't even perl installed ;-) > > > > > > Best wishes, > > > Martin > > > > > > > > > Natalya Kolesnikova schrieb: > > > > > > > Hello David, > > > > > > > > thank you very much for your support! > > > > > > > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working > ok, > > > too. > > > > But when I run conv_doc.pl (or doc2html.pl ) from command line wit > h a > > > > pdf-file as a argument I get error message: > > > > bad interpreter: no such file or directory/usr/bin/perl. > > > > > > > > What is here wrong??? > > > > > > > > best regards > > > > Natalya > > > > > > > > > > > >>OK, so far we have established: > > > >> > > > >>1) Htdig is reading a .PDF file > > > >>2) You are attempting to use /usr/local/bin/conv_doc.pl to conve > rt > > > it. > > > >>3) No text is being extracted from the .PDF file, so it is not > being > > > >>indexed. > > > >> > > > >>This suggests that the fault is with /usr/local/bin/conv_doc.pl. > Please > > > >>try > > > >>executing this from the command line: > > > >> > > > >> /usr/local/bin/conv_doc.pl somepdffile.pdf > > > >> > > > >>where somepdffile.pdf is a PDF file from which it should be able to > > > >>extract > > > >>text. See what happens. > > > >>This is a necessary step in the diagnosis. > > > >> > > > >>David Adams > > > >>Corporate Information Services > > > >>Information Systems Services > > > >>University of Southampton > > > >> > > > >> > > > >>----- Original Message ----- > > > >>From: "Natalya Kolesnikova" <[EMAIL PROTECTED]> > > > >>To: "Gilles Detillieux" <[EMAIL PROTECTED]> > > > >>Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > > > >>Sent: Thursday, October 09, 2003 9:51 AM > > > >>Subject: Re: [htdig] PDF-SEARCH > > > >> > > > >> > > > >> > > > >>>Yes, I get error message "Deleted: no excerpt"!!! > > > >>> > > > >>>Natalya > > > >>> > > > >>> > > > >>>>According to Natalya Kolesnikova: > > > >>>> > > > >>>>>Thank you, David, for your help! > > > >>>>> > > > >>>>>But when I run htmerge, I get follow message: > > > >>>>>htmerge: Document database has no URLs. Check your config file a > nd > > > >> > > > >>try > > > >> > > > >>>>>running htdig again. > > > >>>> > > > >>>>Are there any other htmerge error messages, such as a "Deleted: n > o > > > >>>>excerpt" > > > >>>>message? I suspect what's happening here is that htdig adds the > > > >> > > > >>single > > > >> > > > >>>>URL for the PDF file, which you specify in start_url, to the > database, > > > >>>>but when it tries to index it, it finds nothing to index. When > > > >> > > > >>htmerge > > > >> > > > >>>>sees that nothing was indexed for this one document, it removes i > t > > > >> > > > >>from > > > >> > > > >>>>the database, but then complains that there are no URLs left in t > he > > > >>>>database. > > > >>>>Seeing all the htmerge error messages (try htmerge -v after htdig > ) > > > >> > > > >>would > > > >> > > > >>>>give us a more complete picture. > > > >>>> > > > >>>>Please follow through on Dave's and my suggestions below... > > > >>>> > > > >>>> > > > >>>>>>Ok, your configuration file contains: > > > >>>>>> > > > >>>>>>external_parsers: application/msword->text/html > > > >>>> > > > >>>>/usr/local/bin/conv_doc.pl > > > >>>> > > > >>>>>>\ > > > >>>>>> application/postscript->text/html > > > >>>> > > > >>>>/usr/local/bin/conv_doc.pl > > > >>>> > > > >>>>>>\ > > > >>>>>> application/pdf->text/html > > > >> > > > >>/usr/local/bin/conv_doc.pl > > > >> > > > >>>>>>so you are using conv_doc.pl. > > > >>>>>> > > > >>>>>>Please check one thing in your configuration file: make sure th > ere > > > >> > > > >>are > > > >> > > > >>>>no > > > >>>> > > > >>>>>>white space characters after the \ characters at the end of lin > es, > > > >>>> > > > >>>>this is > > > >>>> > > > >>>>>>most important. > > > >>>> > > > >>>>My first hunch is that this isn't the problem, because if htdig > didn't > > > >>>>see the full external_parsers definition (all 3 lines of it), it > > > >> > > > >>likely > > > >> > > > >>>>would be trying to use acroread and the PDF:: class, so we'd see > > > >> > > > >>messages > > > >> > > > >>>>>>from there. However, it's an easy thing to check for, and alwa > ys > a > > > >> > > > >>good > > > >> > > > >>>>idea to pay close attention to in any case, so please do have a l > ook > > > >> > > > >>at > > > >> > > > >>>>these lines. > > > >>>> > > > >>>> > > > >>>>>>If your configuration file is OK, then the problem must be with > > > >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls. > > > >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line wi > th > > > >> > > > >>a > > > >> > > > >>>>.PDF > > > >>>> > > > >>>>>>file as argument and see what the result is. > > > >>>> > > > >>>>This is a very important test. Your first test, with the start_u > rl > > > >> > > > >>set > > > >>to > > > >> > > > > > > > > > > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introd > uct > ion_to_IPR.pdf > > > > > > > >>>>showed that it failed with this single PDF file, which suggests a > > > >> > > > >>problem > > > >> > > > >>>>either with that PDF file or with the setup of the external parse > r. > > > >>>>The next step is to find out which is at fault, and this test wil > l > do > > > >>>>that. If it fails on the introduction_to_IPR.pdf file (i.e. it > > > >> > > > >>produces > > > >> > > > >>>>no output), try it on a few other files as well. If it doesn't w > ork > > > >> > > > >>on > > > >> > > > >>>>any of them, I'd suspect that conv_doc.pl is not properly > configured. > > > >>>>In this case, you should try pdftotext directly on these PDF file > s > to > > > >>>>see if that works. > > > >>>> > > > >>>>If it produces output for some PDF files, but not others, it may > be > > > >> > > > >>that > > > >> > > > >>>>the ones for which it produces nothing actually contain no indexa > ble > > > >> > > > >>text. > > > >> > > > >>>>Some PDF files contain only image data, including perhaps scanned > > > >> > > > >>pages > > > >> > > > >>>>that display as text, but in fact are only a "picture" of a page. > > > >>>> > > > >>>>Once you can get conv_doc.pl to spit out text when run manually, > > > >>>>the following step will be to try htdig on those same PDF files, > > > >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time, > > > >>>>so htdig shows each word it parses). If you get that far, then t > he > > > >>>>next stage would be to use your original start_url to index your > whole > > > >>>>site, and see if it will find all the PDF files. If it doesn't, > see > > > >>>>http://www.htdig.org/FAQ.html#q5.27 > > > >>>> > > > >>>>-- > > > >>>>Gilles R. Detillieux E-mail: > <[EMAIL PROTECTED]> > > > >>>>Spinal Cord Research Centre WWW: > > > >> > > > >>http://www.scrc.umanitoba.ca/ > > > >> > > > >>>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > >>>> > > > >> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.net email is sponsored by: SF.net Giveback Program. > > > SourceForge.net hosts over 70,000 Open Source Projects. > > > See the people who have HELPED US provide better services: > > > Click here: http://sourceforge.net/supporters.php > > > _______________________________________________ > > > ht://Dig general mailing list: <[EMAIL PROTECTED]> > > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > > List information (subscribe/unsubscribe, etc.) > > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > > -- > > NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien... > > Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService > > > > Jetzt kostenlos anmelden unter http://www.gmx.net > > > > +++ GMX - die erste Adresse f�r Mail, Message, More! +++ > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: SF.net Giveback Program. > > SourceForge.net hosts over 70,000 Open Source Projects. > > See the people who have HELPED US provide better services: > > Click here: http://sourceforge.net/supporters.php > > _______________________________________________ > > ht://Dig general mailing list: <[EMAIL PROTECTED]> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > SourceForge.net hosts over 70,000 Open Source Projects. > See the people who have HELPED US provide better services: > Click here: http://sourceforge.net/supporters.php > _______________________________________________ > ht://Dig general mailing list: <[EMAIL PROTECTED]> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > -- NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse f�r Mail, Message, More! +++ ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

