Ok, pdf-search runs!
I try now to index .ppt and .xls Files:
htdig.conf
external_parsers:    application/rtf->text/html
/srv/www/htdig/doc2html/doc2html.pl \
text/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \
application/pdf->text/html /srv/www/htdig/doc2html/doc2html.pl \
application/vnd.ms-excel->text/html /srv/www/htdig/doc2html/doc2html.pl \
application/vnd.ms-powerpoint->text/html
/srv/www/htdig/doc2html/doc2html.pl\

ppthtml and xlhtml are working from command line ok.
doc2html with .ppt-file or .xls-file as Argument is working ok, also.

But if I run rundig, I neither see .ppt-files nor .xls-files indexing!

best regards
Natalya,

> Glad that you have made progress.
> 
> I don't recognize the "PRINT OUTPUNT!!!???" message, but to run doc2html.
> pl
> >from the command line it is necessary to give two arguments:
> 
>     doc2html.pl  filename.pdf   application/pdf
> 
> It this fails try:
> 
>     pdf2html.pl filename.pdf
> 
> David Adams
> Corporate Information Services
> Information Systems Services
> University of Southampton
> 
> 
> 
> ----- Original Message -----
> From: "Natalya Kolesnikova" <[EMAIL PROTECTED]>
> To: "Martin Joisten" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Friday, October 10, 2003 12:27 PM
> Subject: Re: [htdig] PDF-SEARCH
> 
> 
> > Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!!
> >
> > If I run doc2html.pl with pdf-file as argument from  command line, I ge
> t
> > PRINT OUTPUNT!!!???
> >
> >
> > best regards
> > Natalya
> >
> > > Hi Natalya,
> > > then it seems that the path to perl is wrong and that's why the Perl
> > > Script(s) don't work.
> > >
> > > Check out the first lines of each Perl Script (.pl) and correct the p
> ath
> > > to perl. Maybe there isn't even perl installed  ;-)
> > >
> > > Best wishes,
> > > Martin
> > >
> > >
> > > Natalya Kolesnikova schrieb:
> > >
> > > > Hello David,
> > > >
> > > > thank you very much for your support!
> > > >
> > > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working
> ok,
> > > too.
> > > >  But when I run conv_doc.pl (or doc2html.pl ) from command line wit
> h a
> > > > pdf-file as a argument I get error message:
> > > > bad interpreter: no such file or directory/usr/bin/perl.
> > > >
> > > > What is here wrong???
> > > >
> > > > best regards
> > > > Natalya
> > > >
> > > >
> > > >>OK, so far we have established:
> > > >>
> > > >>1)    Htdig is reading a .PDF file
> > > >>2)    You are attempting to use /usr/local/bin/conv_doc.pl to conve
> rt
> > > it.
> > > >>3)    No text is being extracted from the .PDF file, so it is not
> being
> > > >>indexed.
> > > >>
> > > >>This suggests that the fault is with /usr/local/bin/conv_doc.pl.
> Please
> > > >>try
> > > >>executing this from the command line:
> > > >>
> > > >>            /usr/local/bin/conv_doc.pl  somepdffile.pdf
> > > >>
> > > >>where somepdffile.pdf is a PDF file from which it should be able to
> > > >>extract
> > > >>text.  See what happens.
> > > >>This is a necessary step in the diagnosis.
> > > >>
> > > >>David Adams
> > > >>Corporate Information Services
> > > >>Information Systems Services
> > > >>University of Southampton
> > > >>
> > > >>
> > > >>----- Original Message -----
> > > >>From: "Natalya Kolesnikova" <[EMAIL PROTECTED]>
> > > >>To: "Gilles Detillieux" <[EMAIL PROTECTED]>
> > > >>Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> > > >>Sent: Thursday, October 09, 2003 9:51 AM
> > > >>Subject: Re: [htdig] PDF-SEARCH
> > > >>
> > > >>
> > > >>
> > > >>>Yes, I get error message "Deleted: no excerpt"!!!
> > > >>>
> > > >>>Natalya
> > > >>>
> > > >>>
> > > >>>>According to Natalya Kolesnikova:
> > > >>>>
> > > >>>>>Thank you, David, for your help!
> > > >>>>>
> > > >>>>>But when I run htmerge, I get follow message:
> > > >>>>>htmerge: Document database has no URLs. Check your config file a
> nd
> > > >>
> > > >>try
> > > >>
> > > >>>>>running htdig again.
> > > >>>>
> > > >>>>Are there any other htmerge error messages, such as a "Deleted: n
> o
> > > >>>>excerpt"
> > > >>>>message?  I suspect what's happening here is that htdig adds the
> > > >>
> > > >>single
> > > >>
> > > >>>>URL for the PDF file, which you specify in start_url, to the
> database,
> > > >>>>but when it tries to index it, it finds nothing to index.  When
> > > >>
> > > >>htmerge
> > > >>
> > > >>>>sees that nothing was indexed for this one document, it removes i
> t
> > > >>
> > > >>from
> > > >>
> > > >>>>the database, but then complains that there are no URLs left in t
> he
> > > >>>>database.
> > > >>>>Seeing all the htmerge error messages (try htmerge -v after htdig
> )
> > > >>
> > > >>would
> > > >>
> > > >>>>give us a more complete picture.
> > > >>>>
> > > >>>>Please follow through on Dave's and my suggestions below...
> > > >>>>
> > > >>>>
> > > >>>>>>Ok, your configuration file contains:
> > > >>>>>>
> > > >>>>>>external_parsers: application/msword->text/html
> > > >>>>
> > > >>>>/usr/local/bin/conv_doc.pl
> > > >>>>
> > > >>>>>>\
> > > >>>>>>              application/postscript->text/html
> > > >>>>
> > > >>>>/usr/local/bin/conv_doc.pl
> > > >>>>
> > > >>>>>>\
> > > >>>>>>              application/pdf->text/html
> > > >>
> > > >>/usr/local/bin/conv_doc.pl
> > > >>
> > > >>>>>>so you are using conv_doc.pl.
> > > >>>>>>
> > > >>>>>>Please check one thing in your configuration file: make sure th
> ere
> > > >>
> > > >>are
> > > >>
> > > >>>>no
> > > >>>>
> > > >>>>>>white space characters after the \ characters at the end of lin
> es,
> > > >>>>
> > > >>>>this is
> > > >>>>
> > > >>>>>>most important.
> > > >>>>
> > > >>>>My first hunch is that this isn't the problem, because if htdig
> didn't
> > > >>>>see the full external_parsers definition (all 3 lines of it), it
> > > >>
> > > >>likely
> > > >>
> > > >>>>would be trying to use acroread and the PDF:: class, so we'd see
> > > >>
> > > >>messages
> > > >>
> > > >>>>>>from there.  However, it's an easy thing to check for, and alwa
> ys
> a
> > > >>
> > > >>good
> > > >>
> > > >>>>idea to pay close attention to in any case, so please do have a l
> ook
> > > >>
> > > >>at
> > > >>
> > > >>>>these lines.
> > > >>>>
> > > >>>>
> > > >>>>>>If your configuration file is OK, then the problem must be with
> > > >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls.
> > > >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line wi
> th
> > > >>
> > > >>a
> > > >>
> > > >>>>.PDF
> > > >>>>
> > > >>>>>>file as argument and see what the result is.
> > > >>>>
> > > >>>>This is a very important test.  Your first test, with the start_u
> rl
> > > >>
> > > >>set
> > > >>to
> > > >>
> > > >
> > >
> >
> http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introd
> uct
> ion_to_IPR.pdf
> > > >
> > > >>>>showed that it failed with this single PDF file, which suggests a
> > > >>
> > > >>problem
> > > >>
> > > >>>>either with that PDF file or with the setup of the external parse
> r.
> > > >>>>The next step is to find out which is at fault, and this test wil
> l
> do
> > > >>>>that.  If it fails on the introduction_to_IPR.pdf file (i.e. it
> > > >>
> > > >>produces
> > > >>
> > > >>>>no output), try it on a few other files as well.  If it doesn't w
> ork
> > > >>
> > > >>on
> > > >>
> > > >>>>any of them, I'd suspect that conv_doc.pl is not properly
> configured.
> > > >>>>In this case, you should try pdftotext directly on these PDF file
> s
> to
> > > >>>>see if that works.
> > > >>>>
> > > >>>>If it produces output for some PDF files, but not others, it may 
> be
> > > >>
> > > >>that
> > > >>
> > > >>>>the ones for which it produces nothing actually contain no indexa
> ble
> > > >>
> > > >>text.
> > > >>
> > > >>>>Some PDF files contain only image data, including perhaps scanned
> > > >>
> > > >>pages
> > > >>
> > > >>>>that display as text, but in fact are only a "picture" of a page.
> > > >>>>
> > > >>>>Once you can get conv_doc.pl to spit out text when run manually,
> > > >>>>the following step will be to try htdig on those same PDF files,
> > > >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time,
> > > >>>>so htdig shows each word it parses).  If you get that far, then t
> he
> > > >>>>next stage would be to use your original start_url to index your
> whole
> > > >>>>site, and see if it will find all the PDF files.  If it doesn't, 
> see
> > > >>>>http://www.htdig.org/FAQ.html#q5.27
> > > >>>>
> > > >>>>--
> > > >>>>Gilles R. Detillieux              E-mail:
> <[EMAIL PROTECTED]>
> > > >>>>Spinal Cord Research Centre       WWW:
> > > >>
> > > >>http://www.scrc.umanitoba.ca/
> > > >>
> > > >>>>Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
> > > >>>>
> > > >>
> > > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > This SF.net email is sponsored by: SF.net Giveback Program.
> > > SourceForge.net hosts over 70,000 Open Source Projects.
> > > See the people who have HELPED US provide better services:
> > > Click here: http://sourceforge.net/supporters.php
> > > _______________________________________________
> > > ht://Dig general mailing list: <[EMAIL PROTECTED]>
> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> > > List information (subscribe/unsubscribe, etc.)
> > > https://lists.sourceforge.net/lists/listinfo/htdig-general
> > >
> >
> > --
> > NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien...
> > Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService
> >
> > Jetzt kostenlos anmelden unter http://www.gmx.net
> >
> > +++ GMX - die erste Adresse f�r Mail, Message, More! +++
> >
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: SF.net Giveback Program.
> > SourceForge.net hosts over 70,000 Open Source Projects.
> > See the people who have HELPED US provide better services:
> > Click here: http://sourceforge.net/supporters.php
> > _______________________________________________
> > ht://Dig general mailing list: <[EMAIL PROTECTED]>
> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> > List information (subscribe/unsubscribe, etc.)
> > https://lists.sourceforge.net/lists/listinfo/htdig-general
> >
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> SourceForge.net hosts over 70,000 Open Source Projects.
> See the people who have HELPED US provide better services:
> Click here: http://sourceforge.net/supporters.php
> _______________________________________________
> ht://Dig general mailing list: <[EMAIL PROTECTED]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
> 



-- 
NEU F�R ALLE - GMX MediaCenter - f�r Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gru�, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse f�r Mail, Message, More! +++



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to