[htdig] Problems with parse_doc.pl and German Umlaute

thch Wed, 25 Oct 2000 05:48:22 -0700

Hi,

I want to index PDF-Files with German Umlaute (�, �, �, �). Some tests had shown me 
that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, 
but the external parser parse_doc.pl has problems with them. It splits words with 
Umlaute in two words without the Umlaut.
For example:

w       beim    41      0
w       diesj   45      0
w       hrigen  50      0
w       den     58      0
w       Platz   62      0

In this case the German word "diesj�hrigen" is split in "diesj" and "hrigen" and I can 
find both with htsearch.

Does anyone know how to solve this problem for example with a modified version of 
parse_doc.pl?

Thanks,

Christian Huhn



------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

[htdig] Problems with parse_doc.pl and German Umlaute

Reply via email to