On Mar 9, 2010, at 16:13, Jonathan S. Shapiro wrote:

> [Re-send - original sent to wrong alias]
>
> One of the mundane issues I want to take up is character and string  
> encoding. The issue that is driving this is JVM/CLR, neither of  
> which properly implements unicode. That is: the "character" type in  
> both runtimes is 16 bit, and this can only encode the Basic  
> Multilingual Plane.

Actually, Java is nominally UTF-16:

http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413

(This is the language, not the VM, spec, yes; but since interop with  
Java standard libraries is presumably the primary thing of interest in  
choosing how a JVM::char is interpreted...)

> [...]
> In CLR (and I believe in JVM), characters outside the BMP can be

(and should be, for interop)

> encoded in strings using surrogate pairs, and with (considerable)  
> care these can be processed. So long as we reject strings that  
> contain malformed surrogate pairs, we should be fine. In practice  
> this means:
> Strings returned from CLR/JVM routines need to be validated.
> Substring operations must fail if they would result in a broken  
> surrogate pair. The easiest way to handle this is to define them in  
> terms of characters rather than code units.

It is possible-but-weird to handle such things as uneven widths and  
invalid substring indexes by defining the high-level interfaces such  
that *numeric* indexes are never seen by most programmers; see Taylor  
Campbell's Scheme work on this idea. It seems reasonable to me, but I  
haven't actually done any work within the system.

The starting premise as I recall it is essentially that even if we  
always work in 32-bit units, that isn't what user-programmers actually  
want -- consider combining characters. Rather, the primitives should  
be iterating over strings in selectable units (grapheme cluster,  
scalar value, utf-N code point, whatever) and parsing.

> If we choose BitC "char" to be 16-bits, then I propose to add a new  
> character type UniChar that covers the full unicode set.  
> Alternatively, we could define "char" as the 32-bit unit, and  
> introduce BMPChar or CodeUnit for interaction with JVM/CLR.

I suggest UTF16Char for maximum obviousness.

-- 
Kevin Reid                                  <http://switchb.org/kpreid/>




_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to