Re: [htdig] Newbie question on excerpts from PDFs

Gilles Detillieux Thu, 07 Sep 2000 11:39:45 -0700

According to Sue Moffitt:
> Our site has many PDFs and on searching with htdig many hits come back with
> rubbish as an excerpt. What these particular PDFs seem to have in common is
> Custom embedded fonts. Is there any way of getting around this problem and
> getting readable excerpts.

That depends on how you're indexing your PDFs.  If you're using acroread,
I'd recommend trying doc2html with pdftotext instead.  If you're already
using an external parser or converter, maybe give acroread 3.0 a try instead.

See http://www.htdig.org/FAQ.html#q4.9

Embedded fonts can be a problem, because there's no guarantee they'll use
standard encodings or even standard glyph names, so there may not be any
way of getting intelligible text out of these documents other than with
your own eyeballs.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Re: [htdig] Newbie question on excerpts from PDFs

Reply via email to