Hello, I have HTML-documents which are UTF-8 encoded and contain english and/or german content. I have written my own Analyser and Filter to replace the german umlauts with the commonly used pair of character (�=ue, �=ae, �=oe) to avoid any problems. Still in the HTML-code the german umlauts are shown as a pair of character representing the UTF-8 encoding (I think). As a result the StandardTokenizer is missinterpreting the string and splitting a word with umlaut into 2 tokens which is of no use anymore. Does anyone ahs experience in this case and can help me back on the road?
Peter MH --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
