Steph - I'll dig out the Fr phone book ordering from wherever it was I read about it!
I thought I ghad it to hand, but I haven;t found it tonight. It can't be far away. On 5 December 2015 at 13:08, stepharo <steph...@free.fr> wrote: > Hi EuanM > > Le 4/12/15 12:42, EuanM a écrit : >> >> I'm currently groping my way to seeing how feature-complete our >> Unicode support is. I am doing this to establish what still needs to >> be done to provide full Unicode support. > > > this is great. Thanks for pushing this. I wrote and collected some roadmap > (analyses on different topics) > on the pharo github project feel free to add this one there. >> >> >> This seems to me to be an area where it would be best to write it >> once, and then have the same codebase incorporated into the Smalltalks >> that most share a common ancestry. >> >> I am keen to get: equality-testing for strings; sortability for >> strings which have ligatures and diacritic characters; and correct >> round-tripping of data. > > Go! > My suggestion is > start small > make steady progress > write tests > commit often :) > > Stef > > What is the french phoneBook ordering because this is the first time I hear > about it. > >> >> Call to action: >> ========== >> >> If you have comments on these proposals - such as "but we already have >> that facility" or "the reason we do not have these facilities is >> because they are dog-slow" - please let me know them. >> >> If you would like to help out, please let me know. >> >> If you have Unicode experience and expertise, and would like to be, or >> would be willing to be, in the 'council of experts' for this project, >> please let me know. >> >> If you have comments or ideas on anything mentioned in this email >> >> In the first instance, the initiative's website will be: >> http://smalltalk.uk.to/unicode.html >> >> I have created a SqueakSource.com project called UnicodeSupport >> >> I want to avoid re-inventing any facilities which already exist. >> Except where they prevent us reaching the goals of: >> - sortable UTF8 strings >> - sortable UTF16 strings >> - equivalence testing of 2 UTF8 strings >> - equivalence testing of 2 UTF16 strings >> - round-tripping UTF8 strings through Smalltalk >> - roundtripping UTF16 strings through Smalltalk. >> As I understand it, we have limited Unicode support atm. >> >> Current state of play >> =============== >> ByteString gets converted to WideString when need is automagically >> detected. >> >> Is there anything else that currently exists? >> >> Definition of Terms >> ============== >> A quick definition of terms before I go any further: >> >> Standard terms from the Unicode standard >> =============================== >> a compatibility character : an additional encoding of a *normal* >> character, for compatibility and round-trip conversion purposes. For >> instance, a 1-byte encoding of a Latin character with a diacritic. >> >> Made-up terms >> ============ >> a convenience codepoint : a single codepoint which represents an item >> that is also encoded as a string of codepoints. >> >> (I tend to use the terms compatibility character and compatibility >> codepoint interchangably. The standard only refers to them as >> compatibility characters. However, the standard is determined to >> emphasise that characters are abstract and that codepoints are >> concrete. So I think it is often more useful and productive to think >> of compatibility or convenience codepoints). >> >> a composed character : a character made up of several codepoints >> >> Unicode encoding explained >> ===================== >> A convenience codepoint can therefore be thought of as a code point >> used for a character which also has a composed form. >> >> The way Unicode works is that sometimes you can encode a character in >> one byte, sometimes not. Sometimes you can encode it in two bytes, >> sometimes not. >> >> You can therefore have a long stream of ASCII which is single-byte >> Unicode. If there is an occasional Cyrillic or Greek character in the >> stream, it would be represented either by a compatibility character or >> by a multi-byte combination. >> >> Using compatibility characters can prevent proper sorting and >> equivalence testing. >> >> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >> and round-tripping probelms. Although avoiding them can *also* cause >> compatibility issues and round-tripping problems. >> >> Currently my thinking is: >> >> a Utf8String class >> an Ordered collection, with 1 byte characters as the modal element, >> but short arrays of wider strings where necessary >> a Utf16String class >> an Ordered collection, with 2 byte characters as the modal element, >> but short arrays of wider strings >> beginning with a 2-byte endianness indicator. >> >> Utf8Strings sometimes need to be sortable, and sometimes need to be >> compatible. >> >> So my thinking is that Utf8String will contain convenience codepoints, >> for round-tripping. And where there are multiple convenience >> codepoints for a character, that it standardises on one. >> >> And that there is a Utf8SortableString which uses *only* normal >> characters. >> >> We then need methods to convert between the two. >> >> aUtf8String asUtf8SortableString >> >> and >> >> aUtf8SortableString asUtf8String >> >> >> Sort orders are culture and context dependent - Sweden and Germany >> have different sort orders for the same diacritic-ed characters. Some >> countries have one order in general usage, and another for specific >> usages, such as phone directories (e.g. UK and France) >> >> Similarly for Utf16 : Utf16String and Utf16SortableString and >> conversion methods >> >> A list of sorted words would be a SortedCollection, and there could be >> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >> seOrder, ukOrder, etc >> >> along the lines of >> aListOfWords := SortedCollection sortBlock: deOrder >> >> If a word is either a Utf8SortableString, or a well-formed Utf8String, >> then we can perform equivalence testing on them trivially. >> >> To make sure a Utf8String is well formed, we would need to have a way >> of cleaning up any convenience codepoints which were valid, but which >> were for a character which has multiple equally-valid alternative >> convenience codepoints, and for which the string currently had the >> "wrong" convenience codepoint. (i.e for any character with valid >> alternative convenience codepoints, we would choose one to be in the >> well-formed Utf8String, and we would need a method for cleaning the >> alternative convenience codepoints out of the string, and replacing >> them with the chosen approved convenience codepoint. >> >> aUtf8String cleanUtf8String >> >> With WideString, a lot of the issues disappear - except >> round-tripping(although I'm sure I have seen something recently about >> 4-byte strings that also have an additional bit. Which would make >> some Unicode characters 5-bytes long.) >> >> >> (I'm starting to zone out now - if I've overlooked anything - obvious, >> subtle, or somewhere in between, please let me know) >> >> Cheers, >> Euan >> >> > >