On 10 March 2010 22:35, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Mar 10, 2010 at 1:25 PM, Michal Suchanek <[email protected]> wrote:
>> What does the "char" type reperesent?
>
> Yes. This is the essence of the question. And as a consequence: what
> does a "char" *mean*?
>
>> Unicode characters are potentially unbound in size, they can consist
>> of a base character and a string of modifiers or may be some
>> multicharacter ligatures or whatever. A character iterator is useful
>> but a complete one returns a substring to handle any such case.
>
> I still go back and forth on whether Unicode hopelessly boogered the
> entire notion of a character, or whether it merely served to reveal
> that the rest of us had the notion boogered all along. I suppose it
> doesn't matter, and I do feel that the Unicode position is more
> consistent than any predecessor I know about.
>
> The essential point you are making, though, is that there is no unit
> smaller than a string that can faithfully represent either a character
> or a glyph in the unicode world, and there isn't even a standard
> representation for characters/glyphs (that is: there are multiple
> normalizations).
>
> And yet, we seem to require some unit that will let us iterate over
> strings in a semantically sensible way. My objection to "char as UCS2"
> is that it isn't a semantically coherent choice. It's a choice that
> depends on the semantics of a particular encoding rather than the
> semantics of the payload.

The correct iteration unit depends heavily on the application and in
part on the internal representation.
In the general case most applications require units of substrings, be
that some sort of characters, graphemes, words, lines, ..
Only write() or equivalent can meaningfully iterate over bytes.

>
> So to answer your question, I agree with your (implicit) observation
> that "char" is misnamed. My personal opinion is that the unit we want
> here is "code point".

What would it be used for? If there are interfaces working in
codepoints they can just pass them around as integers, they are binary
data like any other integers.

If you are bothered by mixing them with other integers just allow
general tagged integers:
CodePoint: int32
BEint32: int32
LEin32t: int32

BEint32 x
LEint32 y

x=y  // Error


>
> But having said that, perhaps you mean to be suggesting that "code
> point" is neither more nor less broken than "code unit" because of
> modifiers. If this is true, there is no real reason to fight very hard
> about it. Is that what you mean to suggest?

Yes, both are broken and not useful in most interfaces.

On 10 March 2010 22:39, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Mar 10, 2010 at 1:28 PM, Sandro Magi <[email protected]> wrote:
>> Just use a full word, and the string representation can use a more
>> packed form to avoid waste.
>
> This is the position that I think sounds right, but I'm not a Unicode
> expert. Let's play with it for a moment.
>
> So you are saying that BitC "char" is 32-bit. That is: a CodePoint.
> Assuming this, when we bring in a data structure from C# having a
> field of type char, how does that type appear in BitC? As a working
> name, let me call it "CodeUnit". Now what relationship, if any, exists
> between CodeUnit and BitC char?

None in general.

There may be application specific relationships.

Assuming the CodeUnit is 16bit in CLR then it's just in16 as any other
integer. It may just happen to contain something that looks like a
character or it may contain garbage. Note the C getc interface that
returns (former) ASCII character as int :->

Sane interfaces won't use it. For insane interfaces typing it as int16
makes it clear what it is.

Thanks

Michal
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to