Todd, As long as others are using it, it's useful to be able to send UTF16, and to successfully import it.
I like systems that play well with others. :-) On 5 December 2015 at 16:35, Todd Blanchard <[email protected]> wrote: > would suggest that the only worthwhile encoding is UTF8 - the rest are > distractions except for being able to read and convert from other encodings > to UTF8. UTF16 is a complete waste of time. > > Read http://utf8everywhere.org/ > > I have extensive Unicode chops from around 1999 to 2004 and my experience > leads me to strongly agree with the views on that site. > > > Sent from the road > > On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote: > > Hi EuanM > > Le 4/12/15 12:42, EuanM a écrit : > > I'm currently groping my way to seeing how feature-complete our > > Unicode support is. I am doing this to establish what still needs to > > be done to provide full Unicode support. > > > this is great. Thanks for pushing this. I wrote and collected some roadmap > (analyses on different topics) > on the pharo github project feel free to add this one there. > > > This seems to me to be an area where it would be best to write it > > once, and then have the same codebase incorporated into the Smalltalks > > that most share a common ancestry. > > > I am keen to get: equality-testing for strings; sortability for > > strings which have ligatures and diacritic characters; and correct > > round-tripping of data. > > Go! > My suggestion is > start small > make steady progress > write tests > commit often :) > > Stef > > What is the french phoneBook ordering because this is the first time I hear > about it. > > > Call to action: > > ========== > > > If you have comments on these proposals - such as "but we already have > > that facility" or "the reason we do not have these facilities is > > because they are dog-slow" - please let me know them. > > > If you would like to help out, please let me know. > > > If you have Unicode experience and expertise, and would like to be, or > > would be willing to be, in the 'council of experts' for this project, > > please let me know. > > > If you have comments or ideas on anything mentioned in this email > > > In the first instance, the initiative's website will be: > > http://smalltalk.uk.to/unicode.html > > > I have created a SqueakSource.com project called UnicodeSupport > > > I want to avoid re-inventing any facilities which already exist. > > Except where they prevent us reaching the goals of: > > - sortable UTF8 strings > > - sortable UTF16 strings > > - equivalence testing of 2 UTF8 strings > > - equivalence testing of 2 UTF16 strings > > - round-tripping UTF8 strings through Smalltalk > > - roundtripping UTF16 strings through Smalltalk. > > As I understand it, we have limited Unicode support atm. > > > Current state of play > > =============== > > ByteString gets converted to WideString when need is automagically detected. > > > Is there anything else that currently exists? > > > Definition of Terms > > ============== > > A quick definition of terms before I go any further: > > > Standard terms from the Unicode standard > > =============================== > > a compatibility character : an additional encoding of a *normal* > > character, for compatibility and round-trip conversion purposes. For > > instance, a 1-byte encoding of a Latin character with a diacritic. > > > Made-up terms > > ============ > > a convenience codepoint : a single codepoint which represents an item > > that is also encoded as a string of codepoints. > > > (I tend to use the terms compatibility character and compatibility > > codepoint interchangably. The standard only refers to them as > > compatibility characters. However, the standard is determined to > > emphasise that characters are abstract and that codepoints are > > concrete. So I think it is often more useful and productive to think > > of compatibility or convenience codepoints). > > > a composed character : a character made up of several codepoints > > > Unicode encoding explained > > ===================== > > A convenience codepoint can therefore be thought of as a code point > > used for a character which also has a composed form. > > > The way Unicode works is that sometimes you can encode a character in > > one byte, sometimes not. Sometimes you can encode it in two bytes, > > sometimes not. > > > You can therefore have a long stream of ASCII which is single-byte > > Unicode. If there is an occasional Cyrillic or Greek character in the > > stream, it would be represented either by a compatibility character or > > by a multi-byte combination. > > > Using compatibility characters can prevent proper sorting and > > equivalence testing. > > > Using "pure" Unicode, ie. "normal encodings", can cause compatibility > > and round-tripping probelms. Although avoiding them can *also* cause > > compatibility issues and round-tripping problems. > > > Currently my thinking is: > > > a Utf8String class > > an Ordered collection, with 1 byte characters as the modal element, > > but short arrays of wider strings where necessary > > a Utf16String class > > an Ordered collection, with 2 byte characters as the modal element, > > but short arrays of wider strings > > beginning with a 2-byte endianness indicator. > > > Utf8Strings sometimes need to be sortable, and sometimes need to be > compatible. > > > So my thinking is that Utf8String will contain convenience codepoints, > > for round-tripping. And where there are multiple convenience > > codepoints for a character, that it standardises on one. > > > And that there is a Utf8SortableString which uses *only* normal characters. > > > We then need methods to convert between the two. > > > aUtf8String asUtf8SortableString > > > and > > > aUtf8SortableString asUtf8String > > > > Sort orders are culture and context dependent - Sweden and Germany > > have different sort orders for the same diacritic-ed characters. Some > > countries have one order in general usage, and another for specific > > usages, such as phone directories (e.g. UK and France) > > > Similarly for Utf16 : Utf16String and Utf16SortableString and > > conversion methods > > > A list of sorted words would be a SortedCollection, and there could be > > pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, > > seOrder, ukOrder, etc > > > along the lines of > > aListOfWords := SortedCollection sortBlock: deOrder > > > If a word is either a Utf8SortableString, or a well-formed Utf8String, > > then we can perform equivalence testing on them trivially. > > > To make sure a Utf8String is well formed, we would need to have a way > > of cleaning up any convenience codepoints which were valid, but which > > were for a character which has multiple equally-valid alternative > > convenience codepoints, and for which the string currently had the > > "wrong" convenience codepoint. (i.e for any character with valid > > alternative convenience codepoints, we would choose one to be in the > > well-formed Utf8String, and we would need a method for cleaning the > > alternative convenience codepoints out of the string, and replacing > > them with the chosen approved convenience codepoint. > > > aUtf8String cleanUtf8String > > > With WideString, a lot of the issues disappear - except > > round-tripping(although I'm sure I have seen something recently about > > 4-byte strings that also have an additional bit. Which would make > > some Unicode characters 5-bytes long.) > > > > (I'm starting to zone out now - if I've overlooked anything - obvious, > > subtle, or somewhere in between, please let me know) > > > Cheers, > > Euan > > > > >
