Re: [htdig] PDF-SEARCH

David Adams Thu, 09 Oct 2003 05:37:15 -0700

OK, so far we have established:

1)    Htdig is reading a .PDF file
2)    You are attempting to use /usr/local/bin/conv_doc.pl to convert it.
3)    No text is being extracted from the .PDF file, so it is not being
indexed.


This suggests that the fault is with /usr/local/bin/conv_doc.pl.  Please try
executing this from the command line:

            /usr/local/bin/conv_doc.pl  somepdffile.pdf

where somepdffile.pdf is a PDF file from which it should be able to extract
text.  See what happens.
This is a necessary step in the diagnosis.

David Adams
Corporate Information Services
Information Systems Services
University of Southampton


----- Original Message ----- 
From: "Natalya Kolesnikova" <[EMAIL PROTECTED]>
To: "Gilles Detillieux" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, October 09, 2003 9:51 AM
Subject: Re: [htdig] PDF-SEARCH


> Yes, I get error message "Deleted: no excerpt"!!!
>
> Natalya
>
> > According to Natalya Kolesnikova:
> > > Thank you, David, for your help!
> > >
> > > But when I run htmerge, I get follow message:
> > > htmerge: Document database has no URLs. Check your config file and try
> > > running htdig again.
> >
> > Are there any other htmerge error messages, such as a "Deleted: no
> > excerpt"
> > message?  I suspect what's happening here is that htdig adds the single
> > URL for the PDF file, which you specify in start_url, to the database,
> > but when it tries to index it, it finds nothing to index.  When htmerge
> > sees that nothing was indexed for this one document, it removes it from
> > the database, but then complains that there are no URLs left in the
> > database.
> > Seeing all the htmerge error messages (try htmerge -v after htdig) would
> > give us a more complete picture.
> >
> > Please follow through on Dave's and my suggestions below...
> >
> > > > Ok, your configuration file contains:
> > > >
> > > > external_parsers: application/msword->text/html
> > /usr/local/bin/conv_doc.pl
> > > > \
> > > >               application/postscript->text/html
> > /usr/local/bin/conv_doc.pl
> > > > \
> > > >               application/pdf->text/html /usr/local/bin/conv_doc.pl
> > > >
> > > > so you are using conv_doc.pl.
> > > >
> > > > Please check one thing in your configuration file: make sure there
are
> > no
> > > > white space characters after the \ characters at the end of lines,
> > this is
> > > > most important.
> >
> > My first hunch is that this isn't the problem, because if htdig didn't
> > see the full external_parsers definition (all 3 lines of it), it likely
> > would be trying to use acroread and the PDF:: class, so we'd see
messages
> > >from there.  However, it's an easy thing to check for, and always a
good
> > idea to pay close attention to in any case, so please do have a look at
> > these lines.
> >
> > > > If your configuration file is OK, then the problem must be with
> > > > /usr/local/bin/conv_doc.pl or the utilities it calls.
> > > > Try running /usr/local/bin/conv_doc.pl from the command line with a
> > .PDF
> > > > file as argument and see what the result is.
> >
> > This is a very important test.  Your first test, with the start_url set
to
> >
>
http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf
> > showed that it failed with this single PDF file, which suggests a
problem
> > either with that PDF file or with the setup of the external parser.
> > The next step is to find out which is at fault, and this test will do
> > that.  If it fails on the introduction_to_IPR.pdf file (i.e. it produces
> > no output), try it on a few other files as well.  If it doesn't work on
> > any of them, I'd suspect that conv_doc.pl is not properly configured.
> > In this case, you should try pdftotext directly on these PDF files to
> > see if that works.
> >
> > If it produces output for some PDF files, but not others, it may be that
> > the ones for which it produces nothing actually contain no indexable
text.
> > Some PDF files contain only image data, including perhaps scanned pages
> > that display as text, but in fact are only a "picture" of a page.
> >
> > Once you can get conv_doc.pl to spit out text when run manually,
> > the following step will be to try htdig on those same PDF files,
> > one at a time, using htdig -ivvvv (note: 4 "v" options this time,
> > so htdig shows each word it parses).  If you get that far, then the
> > next stage would be to use your original start_url to index your whole
> > site, and see if it will find all the PDF files.  If it doesn't, see
> > http://www.htdig.org/FAQ.html#q5.27
> >
> > -- 
> > Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> > Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
> > Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
> >



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] PDF-SEARCH

Reply via email to