I have a definite diagnosis, perhaps somebody else can provide a good cure.

Your web server is returning PDF files with

Content-Type: application/pdf; charset=iso-8859-1

Htdig recognises this as PDF, but fails to call the external parser and
instead tries to invoke acroread.
The external_parsers: attribute in the configuration file is set for
"application/pdf" only, not
"application/pdf; charset=iso-8859-1".

--
David Adams
Computing Services
Southampton University


----- Original Message -----
From: "Per-Henrik Persson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 24, 2001 3:21 PM
Subject: Re: [htdig] DELETED, no excerpt on PDF's


> * David Adams <[EMAIL PROTECTED]> [010724 15:54]:
> > With more information we can make better informed guesses.
> >
> > Do you ONLY get "no excerpt" with PDF files?
>
>
>
> > Are you using doc2html or conv_doc to index other types of document, and
are
> > they OK?
>
> I only use doc2html to index pdf-files, don't have any other files that
> I'm interested in indexing.
>
> > Are you doing a simple run of htdig followed by htmerge, or something
more
> > complicated, such as merging two or more runs of htdig?
>
> First I run a simple "htdig -v -a -i"... Then I get the usual output
> while indexing. For one pdf-file that is:
>
>
207:209:7:http://www.citu.lu.se/cituverkstad/allmant/mjukvara/manualer/flash
4_SW.pdf:   size = 2660517
>
> That parts seems fine...
>
> then I run "htmerge -vvv -a" and get a lot of output... for th pdf-file
> it is:
>
> 202/http://www.citu.lu.se/cituverkstad/allmant/mjukvara/illustrator.htm
> Deleted, no excerpt:
209/http://www.citu.lu.se/cituverkstad/allmant/mjukvara/manualer/flash4_SW.p
df
> 188/http://www.citu.lu.se/cituverkstad/allmant/mjukvara/mediacleaner.html
>
>
> > Have you tried producing a log from doc2html? - This will report on how
many
> > bytes of text it has extracted from each file.
>
> No, I haven't tried using logfiles but when I run doc2html manually on
> the pdf-file above I get a _large_ html-file that is totally valid html.
>
> > Do the PDF files contain words which are not in your bad words list?
>
> Yes...
>
> Thanx,
>
> P-H
>
>
****************************************************************************
***
> Per-Henrik Persson                          0703-68 53 86
> [EMAIL PROTECTED]                              http://www.whatever.nu
>
> "Just because something doesn't work, it doesn't mean it can't be used..."
>
****************************************************************************
***
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to