I've spent quite a bit of time over the last year trying to implement RFC 3454 (Preparation of Internationalized Strings, aka 'StringPrep'). This RFC is also a dependency for RFC 3491 (Internationalized Domain Names / IDNA) which is something that I also need to support.
The problem that I've been struggling with in .NET is that of Unicode Code Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair encoding scheme that the Unicode Spec defined in section 3.7 of the Unicode Spec (http://www.unicode.org/book/ch03.pdf). Related to Surrogate Pairs, are the whole set of Unicode Combining characters. The problem, then, is this: When I iterate over a string using the .NET StringInfo class I get a set of graphemes. These graphemes correctly handle the combining characters and surrogate pairs, and end up giving me a single UTF-32 Code Point for each grapheme. BUT, let's say the original string had U:0x10FF1 encoded as a UTF8 surrogate paid. This character is illegal in a particular stringprep profile. The original string also had a combining character sequence U:301 + U:302 (for example) and the grapheme that the StringInfo class reports for this is also U:0x10FF1. The problem is that each of the combining characters IS legal in the stringprep profile, but I have no way of telling if the original data was the (illegal) UTF-32 code point, or the (legal) combining characters. Has anyone implemented any of this stuff in .NET ? -- Chris Mullins -- Chris Mullins =================================== This list is hosted by DevelopMentor� http://www.develop.com Some .NET courses you may be interested in: NEW! Guerrilla ASP.NET, 17 May 2004, in Los Angeles http://www.develop.com/courses/gaspdotnetls View archives and manage your subscription(s) at http://discuss.develop.com
