Re: [bitc-dev] BitC 0.20: Unicode

Michal Suchanek Wed, 10 Mar 2010 16:08:31 -0800

On 11 March 2010 00:05, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Mar 10, 2010 at 2:21 PM, Michal Suchanek <[email protected]> wrote:
>> The correct iteration unit depends heavily on the application and in
>> part on the internal representation.
>
> Yes. And this makes it very unfortunate to be silent on the choice of
> internal representation.
>
>> In the general case most applications require units of substrings, be
>> that some sort of characters, graphemes, words, lines, ..
>> Only write() or equivalent can meaningfully iterate over bytes.
>
> Also normalization validators, and if "char" is UCS2 or UCS4 it is
> necessary for string building mechanisms to know the rules of
> composition.
>
>>> So to answer your question, I agree with your (implicit) observation
>>> that "char" is misnamed. My personal opinion is that the unit we want
>>> here is "code point".
>>
>> What would it be used for? If there are interfaces working in
>> codepoints they can just pass them around as integers, they are binary
>> data like any other integers.
>
> In abstract, yes. As a practical matter, they have different semantics
> from integers. In Haskell it would make sense to do something like:
>
>   (def BitC.char (NewType int32))
>
>>> But having said that, perhaps you mean to be suggesting that "code
>>> point" is neither more nor less broken than "code unit" because of
>>> modifiers. If this is true, there is no real reason to fight very hard
>>> about it. Is that what you mean to suggest?
>>
>> Yes, both are broken and not useful in most interfaces.
>
> The more I think about this, the more I tend to agree, but there is a
> practical problem with this position: it means that simple things like
> sorting and searching become very heap-intensive. Most
> sorting/searching problems on characters over a string having known
> normalization can (and should) be reduced to sorting/searching over
> code points or code units for reasons of efficiency.


That's a matter of differentiating between applications. Sometimes you
want the sorting to be fast and give *some* order which is only
required to be always the same, sometimes you want a particular order
which has to be determined by tediously consulting multiple tables,
depends on language, time of the day, the shoe size of the user, and
whatnot.

The problem with quick sorting is that if you have an fast-ASCII-sort
it breaks the moment you put a single complex string into the sorted
set. It still gives a consistent order in the sense it does not change
between invocations but it is no longer what the user expects of
string sort. Still it is usable for doing sorted array bisection or
the like.

An optimized sort routine need not actually slice the strings or may
work with a constant number of slices which are updated to point to
different parts of different strings, even in the complex case.

Thanks

Michal
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC 0.20: Unicode

Reply via email to