Re: [Pharo-dev] Unicode Support

Eliot Miranda Fri, 11 Dec 2015 15:10:40 -0800

Hi Todd,

> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <[email protected]> wrote:
> 
> 
>> On Dec 11, 2015, at 12:19, EuanM <[email protected]> wrote:
>> 
>> "If it hasn't already been said, please do not conflate Unicode and
>> UTF-8. I think that would be a recipe for
>> a high P.I.T.A. factor."  --Richard Sargent
> 
> Well, yes. But  I think you guys are making this way too hard.
> 
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97.  How the number 97 is 
> represented in the computer is irrelevant.
> 
> Now we get to transfer encodings.  These are UTF8, UTF16, etc....  A transfer 
> encoding specifies the binary representation of the sequence of code points.
> 
> UTF8 is a variable length byte encoding.  You read it one byte at a time, 
> aggregating byte sequences to 'code points'.  ByteArray would be an excellent 
> choice as a superclass but it must be understood that #at: or #at:put refers 
> to a byte, not a character.  If you want characters, you have to start at the 
> beginning and process it sequentially, like a stream (if working in the ASCII 
> domain - you can generally 'cheat' this a bit).  A C representation would be 
> char utf8[]
> 
> UTF16 is also a variable length encoding of two byte quantities - what C used 
> to call a 'short int'.  You process it in two byte chunks instead of one byte 
> chunks.  Like UTF8, you must read it sequentially to interpret the 
> characters.  #at and #at:put: would necessarily refer to byte pairs and not 
> characters.  A C representation would be short utf16[];  It would also to 50% 
> space inefficient for ASCII - which is normally the bulk of your text.
> 
> Realistically, you need exactly one in-memory format and stream 
> readers/writers that can convert (these are typically table driven state 
> machines).  My choice would be UTF8 for the internal memory format and the 
> ability to read and write from UTF8 to UTF16.  
> 
> But I stress again...strings don't really need indexability as much as you 
> think and neither UTF8 nor UTF16 provide this property anyhow as they are 
> variable length encodings.  I don't see any sensible reason to have more than 
> one in-memory binary format in the image.


The only reasons are space and time.  If a string only contains code points in 
the range 0-255 there's no point in squandering 4 bytes per code point (same 
goes for 0-65535).  Further, if in some application interchange is more 
important than random access it may make sense in performance grounds to use 
utf-8 directly.

Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat it 
too.

> My $0.02c

_,,,^..^,,,_ (phone)

> 
>> I agree. :-)
>> 
>> Regarding UTF-16, I just want to be able to export to, and receive
>> from, Windows (and any other platforms using UTF-16 as their native
>> character representation).
>> 
>> Windows will always be able to accept UTF-16.  All Windows apps *might
>> well* export UTF-16.  There may be other platforms which use UTF-16 as
>> their native format.  I'd just like to be able to cope with those
>> situations.  Nothing more.
>> 
>> All this is requires is a Utf16String class that has an asUtf8String
>> method (and any other required conversion methods). 
>

Re: [Pharo-dev] Unicode Support

Reply via email to