On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote: > On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <[email protected]> wrote: > > "Andrei Alexandrescu" <[email protected]> wrote in message > > news:[email protected]... > > > >> On 1/13/11 10:26 PM, Nick Sabalausky wrote: > >> [snip] > >> > >>> [ 'f', {u with the umlaut}, 'n', 'f' ] > >>> > >>> Or: > >>> > >>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] > >>> > >>> Those *both* get rendered exactly the same, and both represent the same > >>> four-letter sequence. In the second example, the 'u' and the {umlaut > >>> combining character} combine to form one grapheme. The f's and n's just > >>> happen to be single-code-point graphemes. > >>> > >>> Note that while some characters exist in pre-combined form (such as the > >>> {u > >>> with the umlaut} above), legend has it there are others than can only > >>> be > >>> represented using a combining character. > >>> > >>> It's also my understanding, though I'm not certain, that sometimes > >>> multiple > >>> combining characters can be used together on the same "root" character. > >> > >> Thanks. One further question is: in the above example with > >> u-with-umlaut, > >> there is one code point that corresponds to the entire combination. Are > >> there combinations that do not have a unique code point? > > > > My understanding is "yes". At least that's what I've heard, and I've > > never > > heard any claims of "no". I don't know of any specific ones offhand, > > though. > > Actually, it might be possible to use any combining character with any > > old > > letter or number (like maybe a 7 with an umlaut), though I'm not certain. > > > > FWIW, the Wikipedia article might help, or at least link to other things > > that might help: http://en.wikipedia.org/wiki/Combining_character > > http://en.wikipedia.org/wiki/Unicode_normalization > > Linked from that page, the normalization process is probably something we > need to look at. Using decomposed canonical form would mean we need more > state than just what code-unit are we on, plus it creates more likelyhood > that a match will be found with part of a grapheme (spir or Michel brought > it up earlier). So I think the correct case is to use composed canonical > form. This is after just reading that page, so maybe I'm missing > something. > > Non-composable combinations would be a problem. The string range is > formed on the basis that the element type is a dchar. If there are > combinations that cannot be composed into a single dchar, then the element > type has to be a dchar array (or some other type which contains all the > info). The other option is to simply leave them decomposed. Then you > risk things like partial matches. > > I'm leaning towards a solution like this: While iterating a string, it > should output dchars in normalized composed form. But a specialized > comparison function should be used when doing things like searches or > regex, because it might not be possible to compose two combining > characters. > > The drawback to this is that a dchar might not be able to represent a > grapheme (only if it cannot be composed), but I think it's too much of a > hit in complexity and performance to make the element type of a string > larger than a dchar.
Well, there's plenty in std.string that already deals in strings rather than dchar, and for the most part, any case where you couldn't fit a grapheme in a dchar could be covered by using a string. > Those who wish to work with a more comprehensive string type can use a > more complex string type such as the one created by spir. > > Does that sound reasonable? We really should have something along those lines it seems. From what little _I_ know, the basic approach that you suggest seems like the correct one, but perhaps someone more knowledgeable will be able to come up with a reason why it's not a good idea. Certainly, I think that any solution that I'd come up with would be similar to what you're suggesting. - Jonathan M Davis
