On Mar 9, 2010, at 16:13, Jonathan S. Shapiro wrote: > [Re-send - original sent to wrong alias] > > One of the mundane issues I want to take up is character and string > encoding. The issue that is driving this is JVM/CLR, neither of > which properly implements unicode. That is: the "character" type in > both runtimes is 16 bit, and this can only encode the Basic > Multilingual Plane.
Actually, Java is nominally UTF-16: http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413 (This is the language, not the VM, spec, yes; but since interop with Java standard libraries is presumably the primary thing of interest in choosing how a JVM::char is interpreted...) > [...] > In CLR (and I believe in JVM), characters outside the BMP can be (and should be, for interop) > encoded in strings using surrogate pairs, and with (considerable) > care these can be processed. So long as we reject strings that > contain malformed surrogate pairs, we should be fine. In practice > this means: > Strings returned from CLR/JVM routines need to be validated. > Substring operations must fail if they would result in a broken > surrogate pair. The easiest way to handle this is to define them in > terms of characters rather than code units. It is possible-but-weird to handle such things as uneven widths and invalid substring indexes by defining the high-level interfaces such that *numeric* indexes are never seen by most programmers; see Taylor Campbell's Scheme work on this idea. It seems reasonable to me, but I haven't actually done any work within the system. The starting premise as I recall it is essentially that even if we always work in 32-bit units, that isn't what user-programmers actually want -- consider combining characters. Rather, the primitives should be iterating over strings in selectable units (grapheme cluster, scalar value, utf-N code point, whatever) and parsing. > If we choose BitC "char" to be 16-bits, then I propose to add a new > character type UniChar that covers the full unicode set. > Alternatively, we could define "char" as the 32-bit unit, and > introduce BMPChar or CodeUnit for interaction with JVM/CLR. I suggest UTF16Char for maximum obviousness. -- Kevin Reid <http://switchb.org/kpreid/> _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
