Re: [bitc-dev] BitC 0.20: Unicode

Jonathan S. Shapiro Wed, 10 Mar 2010 15:08:28 -0800

On Wed, Mar 10, 2010 at 2:21 PM, Michal Suchanek <[email protected]> wrote:
> The correct iteration unit depends heavily on the application and in
> part on the internal representation.


Yes. And this makes it very unfortunate to be silent on the choice of
internal representation.

> In the general case most applications require units of substrings, be
> that some sort of characters, graphemes, words, lines, ..
> Only write() or equivalent can meaningfully iterate over bytes.

Also normalization validators, and if "char" is UCS2 or UCS4 it is
necessary for string building mechanisms to know the rules of
composition.

>> So to answer your question, I agree with your (implicit) observation
>> that "char" is misnamed. My personal opinion is that the unit we want
>> here is "code point".
>
> What would it be used for? If there are interfaces working in
> codepoints they can just pass them around as integers, they are binary
> data like any other integers.

In abstract, yes. As a practical matter, they have different semantics
from integers. In Haskell it would make sense to do something like:

   (def BitC.char (NewType int32))

>> But having said that, perhaps you mean to be suggesting that "code
>> point" is neither more nor less broken than "code unit" because of
>> modifiers. If this is true, there is no real reason to fight very hard
>> about it. Is that what you mean to suggest?
>
> Yes, both are broken and not useful in most interfaces.

The more I think about this, the more I tend to agree, but there is a
practical problem with this position: it means that simple things like
sorting and searching become very heap-intensive. Most
sorting/searching problems on characters over a string having known
normalization can (and should) be reduced to sorting/searching over
code points or code units for reasons of efficiency.

>>Now what relationship, if any, exists between CodeUnit and BitC char?
>
> None in general....
> Sane interfaces won't use it. For insane interfaces typing it as int16
> makes it clear what it is.

By this definition, there are a great many not-sane interfaces
pre-existing in CLR and JVM.

Broadly, I think I agree with you, except that we should not confuse
matters further by overloading int16/int32 even more. UCS2/UCS4 may
have the same representation, but the descriptive distinction seems
useful to me.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC 0.20: Unicode

Reply via email to