Re: [bitc-dev] BitC 0.20: Unicode

Jonathan S. Shapiro Wed, 10 Mar 2010 13:45:45 -0800

On Wed, Mar 10, 2010 at 1:25 PM, Michal Suchanek <[email protected]> wrote:
> What does the "char" type reperesent?


Yes. This is the essence of the question. And as a consequence: what
does a "char" *mean*?

> Unicode characters are potentially unbound in size, they can consist
> of a base character and a string of modifiers or may be some
> multicharacter ligatures or whatever. A character iterator is useful
> but a complete one returns a substring to handle any such case.

I still go back and forth on whether Unicode hopelessly boogered the
entire notion of a character, or whether it merely served to reveal
that the rest of us had the notion boogered all along. I suppose it
doesn't matter, and I do feel that the Unicode position is more
consistent than any predecessor I know about.

The essential point you are making, though, is that there is no unit
smaller than a string that can faithfully represent either a character
or a glyph in the unicode world, and there isn't even a standard
representation for characters/glyphs (that is: there are multiple
normalizations).

And yet, we seem to require some unit that will let us iterate over
strings in a semantically sensible way. My objection to "char as UCS2"
is that it isn't a semantically coherent choice. It's a choice that
depends on the semantics of a particular encoding rather than the
semantics of the payload.

So to answer your question, I agree with your (implicit) observation
that "char" is misnamed. My personal opinion is that the unit we want
here is "code point".

But having said that, perhaps you mean to be suggesting that "code
point" is neither more nor less broken than "code unit" because of
modifiers. If this is true, there is no real reason to fight very hard
about it. Is that what you mean to suggest?

shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC 0.20: Unicode

Reply via email to