According to Andoni Ayala:
> Works fine the search of accented words in html files, but, in .pdf
> files not work fine.
>
> Example:
>
> in .pdf file: petici�n
>
> but when i run "rundig" it save in db.wordlist the word "petici"
>
> �where are my mistake?
HTML files will generally use ISO-8859-1 (Latin 1) encoding, or SGML
entities which htdig will map to Latin 1, for accented characters.
PDF documents may use any of a number of different encodings. When
you use acroread to parse these, it makes no attempt to remap these
encodings so accents won't show up unless the document happened to
encode everything in Latin 1. When you use pdftotext (from conv_doc.pl
or one of the other external converters or parsers), it will attempt
to remap the various encodings to ISO-8859-1, so it's likely to work
better, but I don't know whether it will always do this correctly.
I think as long as the embedded fonts use standard glyph names, it
should work, but I'm not completely certain about how xpdf/pdftotext
work internally.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.