On Wed, Mar 10, 2010 at 1:25 PM, Michal Suchanek <[email protected]> wrote: > What does the "char" type reperesent?
Yes. This is the essence of the question. And as a consequence: what does a "char" *mean*? > Unicode characters are potentially unbound in size, they can consist > of a base character and a string of modifiers or may be some > multicharacter ligatures or whatever. A character iterator is useful > but a complete one returns a substring to handle any such case. I still go back and forth on whether Unicode hopelessly boogered the entire notion of a character, or whether it merely served to reveal that the rest of us had the notion boogered all along. I suppose it doesn't matter, and I do feel that the Unicode position is more consistent than any predecessor I know about. The essential point you are making, though, is that there is no unit smaller than a string that can faithfully represent either a character or a glyph in the unicode world, and there isn't even a standard representation for characters/glyphs (that is: there are multiple normalizations). And yet, we seem to require some unit that will let us iterate over strings in a semantically sensible way. My objection to "char as UCS2" is that it isn't a semantically coherent choice. It's a choice that depends on the semantics of a particular encoding rather than the semantics of the payload. So to answer your question, I agree with your (implicit) observation that "char" is misnamed. My personal opinion is that the unit we want here is "code point". But having said that, perhaps you mean to be suggesting that "code point" is neither more nor less broken than "code unit" because of modifiers. If this is true, there is no real reason to fight very hard about it. Is that what you mean to suggest? shap _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
