Problem tokenizing UTF-8 with geman umlauts

PEP AD Server Administrator Wed, 19 May 2004 08:37:49 -0700

Hello,
I have HTML-documents which are UTF-8 encoded and contain english and/or
german content. I have written my own Analyser and Filter to replace the
german umlauts with the commonly used pair of character (�=ue, �=ae, �=oe)
to avoid any problems. Still in the HTML-code the german umlauts are shown
as a pair of character representing the UTF-8 encoding (I think). As a
result the StandardTokenizer is missinterpreting the string and splitting a
word with umlaut into 2 tokens which is of no use anymore.
Does anyone ahs experience in this case and can help me back on the road?


Peter MH

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem tokenizing UTF-8 with geman umlauts

Reply via email to