Hi Todd, > On Dec 11, 2015, at 12:57 PM, Todd Blanchard <[email protected]> wrote: > > >> On Dec 11, 2015, at 12:19, EuanM <[email protected]> wrote: >> >> "If it hasn't already been said, please do not conflate Unicode and >> UTF-8. I think that would be a recipe for >> a high P.I.T.A. factor." --Richard Sargent > > Well, yes. But I think you guys are making this way too hard. > > A unicode character is an abstract idea - for instance the letter 'a'. > The letter 'a' has a code point - its the number 97. How the number 97 is > represented in the computer is irrelevant. > > Now we get to transfer encodings. These are UTF8, UTF16, etc.... A transfer > encoding specifies the binary representation of the sequence of code points. > > UTF8 is a variable length byte encoding. You read it one byte at a time, > aggregating byte sequences to 'code points'. ByteArray would be an excellent > choice as a superclass but it must be understood that #at: or #at:put refers > to a byte, not a character. If you want characters, you have to start at the > beginning and process it sequentially, like a stream (if working in the ASCII > domain - you can generally 'cheat' this a bit). A C representation would be > char utf8[] > > UTF16 is also a variable length encoding of two byte quantities - what C used > to call a 'short int'. You process it in two byte chunks instead of one byte > chunks. Like UTF8, you must read it sequentially to interpret the > characters. #at and #at:put: would necessarily refer to byte pairs and not > characters. A C representation would be short utf16[]; It would also to 50% > space inefficient for ASCII - which is normally the bulk of your text. > > Realistically, you need exactly one in-memory format and stream > readers/writers that can convert (these are typically table driven state > machines). My choice would be UTF8 for the internal memory format and the > ability to read and write from UTF8 to UTF16. > > But I stress again...strings don't really need indexability as much as you > think and neither UTF8 nor UTF16 provide this property anyhow as they are > variable length encodings. I don't see any sensible reason to have more than > one in-memory binary format in the image.
The only reasons are space and time. If a string only contains code points in the range 0-255 there's no point in squandering 4 bytes per code point (same goes for 0-65535). Further, if in some application interchange is more important than random access it may make sense in performance grounds to use utf-8 directly. Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat it too. > My $0.02c _,,,^..^,,,_ (phone) > >> I agree. :-) >> >> Regarding UTF-16, I just want to be able to export to, and receive >> from, Windows (and any other platforms using UTF-16 as their native >> character representation). >> >> Windows will always be able to accept UTF-16. All Windows apps *might >> well* export UTF-16. There may be other platforms which use UTF-16 as >> their native format. I'd just like to be able to cope with those >> situations. Nothing more. >> >> All this is requires is a Utf16String class that has an asUtf8String >> method (and any other required conversion methods). >
