Terry,

It would appear as though someone set up html and plain text
versions of these PDF files, and that is what is really being
indexed.  And it looks like someone did run them through an
OCR program (not part of htdig I don't think!)  Take a look at:

http://profiles.nlm.nih.gov/BB/A/A/A/A/
  (html version)
http://profiles.nlm.nih.gov/BB/A/L/A/T/_/
  (directory containing the following 2 files:)
http://profiles.nlm.nih.gov/BB/A/L/A/T/_/bbalat.ocr
  (text version)
http://profiles.nlm.nih.gov/BB/A/A/A/A/_/bbaaaa.pdf
  (PDF version)

My guess is that someone configured htdig to index the .ocr or
html files, but changed the URLs in the database to point to the
PDF, perhaps by using the "url_part_aliases" attribute.  Interestingly,
the PDF file names (minus the ".pdf" extension) can be derived from
the directory paths.  Pretty clever...

Maybe your htdig config file and/or "run" scripts can provide more
insight...

Terry Luedtke wrote:
> 
> Hello,
> 
> We have a search that returns a PDF file as the best hit. But the PDF file is an 
>image, not text, so I don't know how htDig is finding it. I have customers who want 
>to know how it does so they can repeat it. We are using doc2html and pdftotext 0.91. 
>When I run the file through doc2html all I get is gibberish. The search is
> 
> 
>http://wwwindex.nlm.nih.gov/cgi/htsearch?config=www_exact;method=or;format=builtin-long;words=what%20would%20like;page=1
> 
> and the PDF file is the first link (bbaaaa.pdf).
> 
> Any explanations on how this file gets indexed? (I'd love to tell them htDig has 
>OCR, but they wouldn't believe it.)
> 
> Thanks,
> Terry Luedtke
> National Library of Medicine

-- 
Tim Peterman - Web Master,
IT&P Unix Support Group Technical Lead
Lockheed Martin EIS/NE&SS, Moorestown, NJ

"Computers are incredibly fast, accurate and stupid. Humans
beings are incredibly slow, inaccurate and brilliant. Together
they are powerful beyond imagination." - Albert Einstein

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to