I've spent quite a bit of time over the last year trying to implement
RFC 3454 (Preparation of Internationalized Strings, aka 'StringPrep').
This RFC is also a dependency for RFC 3491 (Internationalized Domain
Names / IDNA) which is something that I also need to support. 

The problem that I've been struggling with in .NET is that of Unicode
Code Points > 0xFFFF. These points are encoded into UTF8 using the
Surrogate Pair encoding scheme that the Unicode Spec defined in section
3.7 of the Unicode Spec (http://www.unicode.org/book/ch03.pdf).

Related to Surrogate Pairs, are the whole set of Unicode Combining
characters. 

The problem, then, is this:
When I iterate over a string using the .NET StringInfo class I get a set
of graphemes. These graphemes correctly handle the combining characters
and surrogate pairs, and end up giving me a single UTF-32 Code Point for
each grapheme. 

BUT, let's say the original string had U:0x10FF1 encoded as a UTF8
surrogate paid. This character is illegal in a particular stringprep
profile. 

The original string also had a combining character sequence U:301 +
U:302 (for example) and the grapheme that the StringInfo class reports
for this is also U:0x10FF1. 

The problem is that each of the combining characters IS legal in the
stringprep profile, but I have no way of telling if the original data
was the (illegal) UTF-32 code point, or the (legal) combining
characters. 

Has anyone implemented any of this stuff in .NET ?

-- 
Chris Mullins




-- 
Chris Mullins

===================================
This list is hosted by DevelopMentor�  http://www.develop.com
Some .NET courses you may be interested in:

NEW! Guerrilla ASP.NET, 17 May 2004, in Los Angeles
http://www.develop.com/courses/gaspdotnetls

View archives and manage your subscription(s) at http://discuss.develop.com

Reply via email to