Re: [htdig] PDF-SEARCH

Natalya Kolesnikova Fri, 10 Oct 2003 04:33:44 -0700

Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!!

If I run doc2html.pl with pdf-file as argument from  command line, I get
PRINT OUTPUNT!!!???



best regards
Natalya
 
> Hi Natalya,
> then it seems that the path to perl is wrong and that's why the Perl 
> Script(s) don't work.
> 
> Check out the first lines of each Perl Script (.pl) and correct the path 
> to perl. Maybe there isn't even perl installed  ;-)
> 
> Best wishes,
> Martin
> 
> 
> Natalya Kolesnikova schrieb:
> 
> > Hello David, 
> > 
> > thank you very much for your support!
> > 
> > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok,
> too.
> >  But when I run conv_doc.pl (or doc2html.pl ) from command line with a
> > pdf-file as a argument I get error message:
> > bad interpreter: no such file or directory/usr/bin/perl.
> > 
> > What is here wrong???
> > 
> > best regards
> > Natalya
> > 
> > 
> >>OK, so far we have established:
> >>
> >>1)    Htdig is reading a .PDF file
> >>2)    You are attempting to use /usr/local/bin/conv_doc.pl to convert
> it.
> >>3)    No text is being extracted from the .PDF file, so it is not being
> >>indexed.
> >>
> >>This suggests that the fault is with /usr/local/bin/conv_doc.pl.  Please
> >>try
> >>executing this from the command line:
> >>
> >>            /usr/local/bin/conv_doc.pl  somepdffile.pdf
> >>
> >>where somepdffile.pdf is a PDF file from which it should be able to
> >>extract
> >>text.  See what happens.
> >>This is a necessary step in the diagnosis.
> >>
> >>David Adams
> >>Corporate Information Services
> >>Information Systems Services
> >>University of Southampton
> >>
> >>
> >>----- Original Message ----- 
> >>From: "Natalya Kolesnikova" <[EMAIL PROTECTED]>
> >>To: "Gilles Detillieux" <[EMAIL PROTECTED]>
> >>Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> >>Sent: Thursday, October 09, 2003 9:51 AM
> >>Subject: Re: [htdig] PDF-SEARCH
> >>
> >>
> >>
> >>>Yes, I get error message "Deleted: no excerpt"!!!
> >>>
> >>>Natalya
> >>>
> >>>
> >>>>According to Natalya Kolesnikova:
> >>>>
> >>>>>Thank you, David, for your help!
> >>>>>
> >>>>>But when I run htmerge, I get follow message:
> >>>>>htmerge: Document database has no URLs. Check your config file and
> >>
> >>try
> >>
> >>>>>running htdig again.
> >>>>
> >>>>Are there any other htmerge error messages, such as a "Deleted: no
> >>>>excerpt"
> >>>>message?  I suspect what's happening here is that htdig adds the
> >>
> >>single
> >>
> >>>>URL for the PDF file, which you specify in start_url, to the database,
> >>>>but when it tries to index it, it finds nothing to index.  When
> >>
> >>htmerge
> >>
> >>>>sees that nothing was indexed for this one document, it removes it
> >>
> >>from
> >>
> >>>>the database, but then complains that there are no URLs left in the
> >>>>database.
> >>>>Seeing all the htmerge error messages (try htmerge -v after htdig)
> >>
> >>would
> >>
> >>>>give us a more complete picture.
> >>>>
> >>>>Please follow through on Dave's and my suggestions below...
> >>>>
> >>>>
> >>>>>>Ok, your configuration file contains:
> >>>>>>
> >>>>>>external_parsers: application/msword->text/html
> >>>>
> >>>>/usr/local/bin/conv_doc.pl
> >>>>
> >>>>>>\
> >>>>>>              application/postscript->text/html
> >>>>
> >>>>/usr/local/bin/conv_doc.pl
> >>>>
> >>>>>>\
> >>>>>>              application/pdf->text/html
> >>
> >>/usr/local/bin/conv_doc.pl
> >>
> >>>>>>so you are using conv_doc.pl.
> >>>>>>
> >>>>>>Please check one thing in your configuration file: make sure there
> >>
> >>are
> >>
> >>>>no
> >>>>
> >>>>>>white space characters after the \ characters at the end of lines,
> >>>>
> >>>>this is
> >>>>
> >>>>>>most important.
> >>>>
> >>>>My first hunch is that this isn't the problem, because if htdig didn't
> >>>>see the full external_parsers definition (all 3 lines of it), it
> >>
> >>likely
> >>
> >>>>would be trying to use acroread and the PDF:: class, so we'd see
> >>
> >>messages
> >>
> >>>>>>from there.  However, it's an easy thing to check for, and always a
> >>
> >>good
> >>
> >>>>idea to pay close attention to in any case, so please do have a look
> >>
> >>at
> >>
> >>>>these lines.
> >>>>
> >>>>
> >>>>>>If your configuration file is OK, then the problem must be with
> >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls.
> >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line with
> >>
> >>a
> >>
> >>>>.PDF
> >>>>
> >>>>>>file as argument and see what the result is.
> >>>>
> >>>>This is a very important test.  Your first test, with the start_url
> >>
> >>set
> >>to
> >>
> >
>
http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf
> > 
> >>>>showed that it failed with this single PDF file, which suggests a
> >>
> >>problem
> >>
> >>>>either with that PDF file or with the setup of the external parser.
> >>>>The next step is to find out which is at fault, and this test will do
> >>>>that.  If it fails on the introduction_to_IPR.pdf file (i.e. it
> >>
> >>produces
> >>
> >>>>no output), try it on a few other files as well.  If it doesn't work
> >>
> >>on
> >>
> >>>>any of them, I'd suspect that conv_doc.pl is not properly configured.
> >>>>In this case, you should try pdftotext directly on these PDF files to
> >>>>see if that works.
> >>>>
> >>>>If it produces output for some PDF files, but not others, it may be
> >>
> >>that
> >>
> >>>>the ones for which it produces nothing actually contain no indexable
> >>
> >>text.
> >>
> >>>>Some PDF files contain only image data, including perhaps scanned
> >>
> >>pages
> >>
> >>>>that display as text, but in fact are only a "picture" of a page.
> >>>>
> >>>>Once you can get conv_doc.pl to spit out text when run manually,
> >>>>the following step will be to try htdig on those same PDF files,
> >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time,
> >>>>so htdig shows each word it parses).  If you get that far, then the
> >>>>next stage would be to use your original start_url to index your whole
> >>>>site, and see if it will find all the PDF files.  If it doesn't, see
> >>>>http://www.htdig.org/FAQ.html#q5.27
> >>>>
> >>>>-- 
> >>>>Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> >>>>Spinal Cord Research Centre       WWW:   
> >>
> >>http://www.scrc.umanitoba.ca/
> >>
> >>>>Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
> >>>>
> >>
> > 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> SourceForge.net hosts over 70,000 Open Source Projects.
> See the people who have HELPED US provide better services:
> Click here: http://sourceforge.net/supporters.php
> _______________________________________________
> ht://Dig general mailing list: <[EMAIL PROTECTED]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
> 

-- 
NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse f�r Mail, Message, More! +++



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] PDF-SEARCH

Reply via email to