On Wed, Mar 10, 2010 at 2:21 PM, Michal Suchanek <[email protected]> wrote: > The correct iteration unit depends heavily on the application and in > part on the internal representation.
Yes. And this makes it very unfortunate to be silent on the choice of internal representation. > In the general case most applications require units of substrings, be > that some sort of characters, graphemes, words, lines, .. > Only write() or equivalent can meaningfully iterate over bytes. Also normalization validators, and if "char" is UCS2 or UCS4 it is necessary for string building mechanisms to know the rules of composition. >> So to answer your question, I agree with your (implicit) observation >> that "char" is misnamed. My personal opinion is that the unit we want >> here is "code point". > > What would it be used for? If there are interfaces working in > codepoints they can just pass them around as integers, they are binary > data like any other integers. In abstract, yes. As a practical matter, they have different semantics from integers. In Haskell it would make sense to do something like: (def BitC.char (NewType int32)) >> But having said that, perhaps you mean to be suggesting that "code >> point" is neither more nor less broken than "code unit" because of >> modifiers. If this is true, there is no real reason to fight very hard >> about it. Is that what you mean to suggest? > > Yes, both are broken and not useful in most interfaces. The more I think about this, the more I tend to agree, but there is a practical problem with this position: it means that simple things like sorting and searching become very heap-intensive. Most sorting/searching problems on characters over a string having known normalization can (and should) be reduced to sorting/searching over code points or code units for reasons of efficiency. >>Now what relationship, if any, exists between CodeUnit and BitC char? > > None in general.... > Sane interfaces won't use it. For insane interfaces typing it as int16 > makes it clear what it is. By this definition, there are a great many not-sane interfaces pre-existing in CLR and JVM. Broadly, I think I agree with you, except that we should not confuse matters further by overloading int16/int32 even more. UCS2/UCS4 may have the same representation, but the descriptive distinction seems useful to me. shap _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
