> On 05 Dec 2015, at 17:35, Todd Blanchard <[email protected]> wrote: > > would suggest that the only worthwhile encoding is UTF8 - the rest are > distractions except for being able to read and convert from other encodings > to UTF8. UTF16 is a complete waste of time. > > Read http://utf8everywhere.org/ > > I have extensive Unicode chops from around 1999 to 2004 and my experience > leads me to strongly agree with the views on that site.
Well, I read the page/document/site as well. It was very interesting indeed, thanks for sharing it. In some sense it made me reconsider my aversion against in-image utf-8 encoding, maybe it could have some value. Absolute storage is more efficient, some processing might also be more efficient, i/o conversions to/from utf-8 become a no-op. What I found nice is the suggestion that most structured parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a large part and just assume its ASCII, which would/could be nice for performance. Also the fact that a lot of strings are (or should be) treated as opaque makes a lot of sense. What I did not like is that much of argumentation is based on issue in the Windows world, take all that away and the document shrinks in half. I would have liked a bit more fundamental CS arguments. Canonicalisation and sorting issues are hardly discussed. In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ? Sven > Sent from the road > > On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote: > >> Hi EuanM >> >> Le 4/12/15 12:42, EuanM a écrit : >>> I'm currently groping my way to seeing how feature-complete our >>> Unicode support is. I am doing this to establish what still needs to >>> be done to provide full Unicode support. >> >> this is great. Thanks for pushing this. I wrote and collected some roadmap >> (analyses on different topics) >> on the pharo github project feel free to add this one there. >>> >>> This seems to me to be an area where it would be best to write it >>> once, and then have the same codebase incorporated into the Smalltalks >>> that most share a common ancestry. >>> >>> I am keen to get: equality-testing for strings; sortability for >>> strings which have ligatures and diacritic characters; and correct >>> round-tripping of data. >> Go! >> My suggestion is >> start small >> make steady progress >> write tests >> commit often :) >> >> Stef >> >> What is the french phoneBook ordering because this is the first time I hear >> about it. >>> >>> Call to action: >>> ========== >>> >>> If you have comments on these proposals - such as "but we already have >>> that facility" or "the reason we do not have these facilities is >>> because they are dog-slow" - please let me know them. >>> >>> If you would like to help out, please let me know. >>> >>> If you have Unicode experience and expertise, and would like to be, or >>> would be willing to be, in the 'council of experts' for this project, >>> please let me know. >>> >>> If you have comments or ideas on anything mentioned in this email >>> >>> In the first instance, the initiative's website will be: >>> http://smalltalk.uk.to/unicode.html >>> >>> I have created a SqueakSource.com project called UnicodeSupport >>> >>> I want to avoid re-inventing any facilities which already exist. >>> Except where they prevent us reaching the goals of: >>> - sortable UTF8 strings >>> - sortable UTF16 strings >>> - equivalence testing of 2 UTF8 strings >>> - equivalence testing of 2 UTF16 strings >>> - round-tripping UTF8 strings through Smalltalk >>> - roundtripping UTF16 strings through Smalltalk. >>> As I understand it, we have limited Unicode support atm. >>> >>> Current state of play >>> =============== >>> ByteString gets converted to WideString when need is automagically detected. >>> >>> Is there anything else that currently exists? >>> >>> Definition of Terms >>> ============== >>> A quick definition of terms before I go any further: >>> >>> Standard terms from the Unicode standard >>> =============================== >>> a compatibility character : an additional encoding of a *normal* >>> character, for compatibility and round-trip conversion purposes. For >>> instance, a 1-byte encoding of a Latin character with a diacritic. >>> >>> Made-up terms >>> ============ >>> a convenience codepoint : a single codepoint which represents an item >>> that is also encoded as a string of codepoints. >>> >>> (I tend to use the terms compatibility character and compatibility >>> codepoint interchangably. The standard only refers to them as >>> compatibility characters. However, the standard is determined to >>> emphasise that characters are abstract and that codepoints are >>> concrete. So I think it is often more useful and productive to think >>> of compatibility or convenience codepoints). >>> >>> a composed character : a character made up of several codepoints >>> >>> Unicode encoding explained >>> ===================== >>> A convenience codepoint can therefore be thought of as a code point >>> used for a character which also has a composed form. >>> >>> The way Unicode works is that sometimes you can encode a character in >>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>> sometimes not. >>> >>> You can therefore have a long stream of ASCII which is single-byte >>> Unicode. If there is an occasional Cyrillic or Greek character in the >>> stream, it would be represented either by a compatibility character or >>> by a multi-byte combination. >>> >>> Using compatibility characters can prevent proper sorting and >>> equivalence testing. >>> >>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>> and round-tripping probelms. Although avoiding them can *also* cause >>> compatibility issues and round-tripping problems. >>> >>> Currently my thinking is: >>> >>> a Utf8String class >>> an Ordered collection, with 1 byte characters as the modal element, >>> but short arrays of wider strings where necessary >>> a Utf16String class >>> an Ordered collection, with 2 byte characters as the modal element, >>> but short arrays of wider strings >>> beginning with a 2-byte endianness indicator. >>> >>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>> compatible. >>> >>> So my thinking is that Utf8String will contain convenience codepoints, >>> for round-tripping. And where there are multiple convenience >>> codepoints for a character, that it standardises on one. >>> >>> And that there is a Utf8SortableString which uses *only* normal characters. >>> >>> We then need methods to convert between the two. >>> >>> aUtf8String asUtf8SortableString >>> >>> and >>> >>> aUtf8SortableString asUtf8String >>> >>> >>> Sort orders are culture and context dependent - Sweden and Germany >>> have different sort orders for the same diacritic-ed characters. Some >>> countries have one order in general usage, and another for specific >>> usages, such as phone directories (e.g. UK and France) >>> >>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>> conversion methods >>> >>> A list of sorted words would be a SortedCollection, and there could be >>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>> seOrder, ukOrder, etc >>> >>> along the lines of >>> aListOfWords := SortedCollection sortBlock: deOrder >>> >>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>> then we can perform equivalence testing on them trivially. >>> >>> To make sure a Utf8String is well formed, we would need to have a way >>> of cleaning up any convenience codepoints which were valid, but which >>> were for a character which has multiple equally-valid alternative >>> convenience codepoints, and for which the string currently had the >>> "wrong" convenience codepoint. (i.e for any character with valid >>> alternative convenience codepoints, we would choose one to be in the >>> well-formed Utf8String, and we would need a method for cleaning the >>> alternative convenience codepoints out of the string, and replacing >>> them with the chosen approved convenience codepoint. >>> >>> aUtf8String cleanUtf8String >>> >>> With WideString, a lot of the issues disappear - except >>> round-tripping(although I'm sure I have seen something recently about >>> 4-byte strings that also have an additional bit. Which would make >>> some Unicode characters 5-bytes long.) >>> >>> >>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>> subtle, or somewhere in between, please let me know) >>> >>> Cheers, >>> Euan >>> >>> >> >>
