Doc2html (latest versions) extracts the Subject, Title, and Keywords from
PDF files if they have them,
but your document doesn't.

Most likely you give a relatively high score to the "Description", and you
have indexed a page with a link like:

<A href="http://profiles.nlm.nih.gov/BB/A/A/A/A/_/bbaaaa.pdf">What I would
Like</a>

--
David Adams
Computing Services
Southampton University


----- Original Message -----
From: "Terry Luedtke" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, April 18, 2001 9:49 PM
Subject: [htdig] How does it find this pdf file?


>
> Hello,
>
> We have a search that returns a PDF file as the best hit. But the PDF file
is an image, not text, so I don't know how htDig is finding it. I have
customers who want to know how it does so they can repeat it. We are using
doc2html and pdftotext 0.91. When I run the file through doc2html all I get
is gibberish. The search is
>
>
http://wwwindex.nlm.nih.gov/cgi/htsearch?config=www_exact;method=or;format=b
uiltin-long;words=what%20would%20like;page=1
>
> and the PDF file is the first link (bbaaaa.pdf).
>
> Any explanations on how this file gets indexed? (I'd love to tell them
htDig has OCR, but they wouldn't believe it.)
>
> Thanks,
> Terry Luedtke
> National Library of Medicine
>
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to