>
> Hi,
>
> I want to index PDF-Files with German Umlaute (�, �, �, �). Some tests had shown me
>that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute,
>but the external parser parse_doc.pl has problems with them. It splits words with
>Umlaute in two words without the Umlaut.
> For example:
>
> w beim 41 0
> w diesj 45 0
> w hrigen 50 0
> w den 58 0
> w Platz 62 0
>
> In this case the German word "diesj�hrigen" is split in "diesj" and "hrigen" and I
>can find both with htsearch.
>
> Does anyone know how to solve this problem for example with a modified version of
>parse_doc.pl?
>
> Thanks,
>
> Christian Huhn
>
You could try the doc2html parser. I think that the latest version,
available from the Ht://Dig web site, will not split words this way, but
I have not tested it thoroughly.
If doc2html does not parse your .PDF files properly, then email an
example to me personally, and I'll make sure that the next version of
doc2html works correctly.
--
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>