Re: [Pharo-dev] Unicode Support

EuanM Mon, 07 Dec 2015 02:53:28 -0800
And indeed, in principle.

On 7 December 2015 at 10:51, EuanM <[email protected]> wrote:
> Verifying assumptions is the key reason why you should documents like
> this out for review.
>
> Sven -
>
> Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)
>
> Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).
>
> We caught the right specification bug for the wrong reason.
>
> Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
> image include and use only 1-byte strings. Chose to use ISO-8859-15"
>
> I have double-checked - each character encoded in ISO Latin 15 (ISO
> 8859-15) is exactly the character represented by the corresponding
> 1-byte codepoint in Unicode 0000 to 00FF,
>
> with the following exceptions:
>
> codepoint 20ac - Euro symbol
> character code a4 (replaces codepoint 00a4 generic currency symbol)
>
> codepoint 0160 Latin Upper Case S with Caron
> character code a6  (replaces codepoint 00A6 was | Unix pipe character)
>
> codepoint 0161 Latin Lower Case s with Caron
> character code a8 (replaces codepoint 00A8 was dierisis)
>
> codepoint 017d Latin Upper Case Z with Caron
> character code b4 (replaces codepoint 00b4 was Acute accent)
>
> codepoint 017e Latin Lower Case Z with Caron
> character code b8 (replaces codepoint 00b8 was cedilla)
>
> codepoint 0152 Upper Case OE ligature = Ethel
> character code bc (replaces codepoint 00bc was 1/4 symbol)
>
> codepoint 0153 Lower Case oe ligature = ethel
> character code bd (replaces codepoint 00bd was 1/2 symbol)
>
> codepoint 0178 Upper Case Y diaeresis
> character code be (replaces codepoint 00be was 3/4 symbol)
>
> Juan - I don't suppose we could persuade you to change to ISO  Latin-1
> from ISO Latin-9 ?
>
> It means we could run the same 1 byte string encoding across  Cuis,
> Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
> and Gnu Smalltalk.
>
> The downside would be that French Y diaeresis would lose the ability
> to use that character, along with users of oe, OE, and s, S, z, Z with
> caron.  Along with the Euro.
>
> https://en.wikipedia.org/wiki/ISO/IEC_8859-15.
>
> I'm confident I understand the use of UTF-8 in principal.
>
>
> On 7 December 2015 at 08:27, Sven Van Caekenberghe <[email protected]> wrote:
>> I am sorry but one of your basic assumptions is completely wrong:
>>
>> 'Les élèves Français' encodeWith: #iso99591.
>>
>> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>>
>> 'Les élèves Français' utf8Encoded.
>>
>> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 
>> 97 105 115]
>>
>> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
>> part !!
>>
>> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in 
>> UTF-8.
>>
>> So more than half the points you make, or the facts that you state, are thus 
>> plain wrong.
>>
>> The only thing that is correct is that the code points are equal, but that 
>> is not the same as the encoding !
>>
>> From this I am inclined to conclude that you do not fundamentally understand 
>> how UTF-8 works, which does not strike me as good basis to design something 
>> called a UTF8String.
>>
>> Sorry.
>>
>> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in 
>> a Unicode world.
>>
>>> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote:
>>>
>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>> http://smalltalk.uk.to/unicode-utf8.html
>>> and my Smalltalk in Small Steps blog at:
>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>>
>>> My current thinking, and understanding.
>>> ==============================
>>>
>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>   b) UTF-8 can encode all of those characters in 1 byte, but can
>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>> can encode additional characters as sequences of multiple bytes.
>>>
>>> 1) Smalltalk has long had multiple String classes.
>>>
>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>   is encoded as a UTF-8 codepoint of nn hex.
>>>
>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>> hex and 7E hex, or between A0 hex and FF hex.
>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>>
>>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>>> hex.
>>> https://en.wikipedia.org/wiki/ASCII
>>>
>>>
>>> 5) a) All character codes which are defined within ISO-8859-1 and also
>>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>>> defined identically in both.
>>>
>>> b) All printable ASCII characters are defined identically in both
>>> ASCII and ISO-8859-1
>>>
>>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>>> defined identically in Unicode UTF-8.
>>>
>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>>> - FF hex ) are defined identically in UTF-8.
>>>
>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>>        all ASCII maps 1:1 to Unicode UTF-8
>>>        all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>>
>>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>>> character  or a valid ASCII character are *also* a valid UTF-8
>>> character.
>>>
>>> 10) ISO-8859-1 characters representing a character with a diacritic,
>>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>>> UTF-8, those character codes which are representing compound glyphs,
>>> are called "compatibility codepoints".
>>>
>>> 11) The preferred Unicode representation of the characters which have
>>> compatibility codepoints is as a  a short set of codepoints
>>> representing the characters which are combined together to form the
>>> glyph of the convenience codepoint, as a sequence of bytes
>>> representing the component characters.
>>>
>>>
>>> 12) Some concrete examples:
>>>
>>> A - aka Upper Case A
>>> In ASCII, in ISO 8859-1
>>> ASCII A - 41 hex
>>> ISO-8859-1 A - 41 hex
>>> UTF-8 A - 41 hex
>>>
>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>>> In ASCII, not in ISO 8859-1
>>> ASCII : BEL  - 07 hex
>>> ISO-8859-1 : 07 hex is not a valid character code
>>> UTF-8 : BEL - 07 hex
>>>
>>> £ (GBP currency symbol)
>>> In ISO-8859-1, not in ASCII
>>> ASCII : A3 hex is not a valid ASCII code
>>> UTF-8: £ - A3 hex
>>> ISO-8859-1: £ - A3 hex
>>>
>>> Upper Case C cedilla
>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>>> *and* a composed set of codepoints
>>> ASCII : C7 hex is not a valid ASCII character code
>>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>>  Upper case C 0043 hex (Upper case C)
>>>      followed by
>>>  cedilla 00B8 hex (cedilla)
>>>
>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>>> aByteString is completely adequate for editing and display.
>>>
>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>>> string, upper and lower case versions of the same character will be
>>> treated differently.
>>>
>>> 15) When sorting any valid ISO-8859-1 string containing
>>> letter+diacritic combination glyphs or ligature combination glyphs,
>>> the glyphs in combination will treated differently to a "plain" glyph
>>> of the character
>>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>>> "fs" will be treated very differently.
>>>
>>> 16) Different nations have different rules about where diacritic-ed
>>> characted and ligature pairs should be placed when in alphabetical
>>> order.
>>>
>>> 17) Some nations even have multiple standards - e.g.  surnames
>>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>>> are treated as beginning equivalently in UK phone directories, but not
>>> in other situations.
>>>
>>>
>>> Some practical upshots
>>> ==================
>>>
>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>>> for any single character it considers valid, or any ByteString it has
>>> made up of characters it considers valid.
>>>
>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>>> other Smalltalk with a single byte ByteString following ASCII or
>>> ISO-8859-1.
>>>
>>> 3) Any Smalltalk (or derivative language) using ByteString can
>>> immediately consider it's ByteString as valid UTF-8, as long as it
>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>>
>>> 4) All of those can be successfully exported to any system using UTF-8
>>> (e.g. HTML).
>>>
>>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>>> a) accept UTF-8 strings with composed characters
>>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>>> that use *only* compatibility codepoints.
>>>
>>>
>>> Class + protocol proposals
>>>
>>>
>>>
>>> a Utf8CompatibilityString class.
>>>
>>>  asByteString  - ensure only compatibility codepoints are used.
>>> Ensure it doews not encode characters above 00FF hex.
>>>
>>>  asIso8859String - ensures only compatibility codepoints are used,
>>> and that the characters are each valid ISO 8859-1
>>>
>>>  asAsciiString - ensures only characters 00hex - 7F hex are used.
>>>
>>>  asUtf8ComposedIso8859String - ensures all compatibility codepoints
>>> are expanded into small OrderedCollections of codepoints
>>>
>>> a Utf8ComposedIso8859String class - will provide sortable and
>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>>
>>> Then a Utf8SortableCollection class - a collection of
>>> Utf8ComposedIso8859Strings words and phrases.
>>>
>>> Custom sortBlocks will define the applicable sort order.
>>>
>>> We can create a collection...  a Dictionary, thinking about it, of
>>> named, prefabricated sortBlocks.
>>>
>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>>
>>> If anyone has better names for the classes, please let me know.
>>>
>>> If anyone else wants to help
>>>   - build these,
>>>   - create SUnit tests for these
>>>   - write documentation for these
>>> Please let me know.
>>>
>>> n.b. I have had absolutely no experience of Ropes.
>>>
>>> My own background with this stuff:  In the early 90's as a Project
>>> Manager implementing office automation systems across a global
>>> company, with offices in the Americas, Western, Eastern and Central
>>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>>> China. The mission-critical application was word-processing.
>>>
>>> Our offices were spread around the globe, and we needed those offices
>>> to successfully exchange documents with their sister offices, and with
>>> the customers in each region the offices were in.
>>>
>>> Unicode was then new, and our platform supplier was the NeXT
>>> Corporation, who had been founder members in of the Unicode Consortium
>>> in 1990.
>>>
>>> So far: I've read the latest version of the Unicode Standard (v8.0).
>>> This is freely downloadable.
>>> I've purchased a paper copy of an earlier release.  New releases
>>> typically consist additional codespaces (i.e. alphabets).  So old
>>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>>> are available second-hand for < $10 / €10).
>>>
>>> The typical change with each release is the addition of further
>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>>> (I'll be going through my V4.0 just to make sure)
>>>
>>> Cheers,
>>>  Euan
>>>
>>>
>>>
>>>
>>> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote:
>>>> Hi EuanM
>>>>
>>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>>
>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>> be done to provide full Unicode support.
>>>>
>>>>
>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>>> (analyses on different topics)
>>>> on the pharo github project feel free to add this one there.
>>>>>
>>>>>
>>>>> This seems to me to be an area where it would be best to write it
>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>> that most share a common ancestry.
>>>>>
>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>> strings which have ligatures and diacritic characters; and correct
>>>>> round-tripping of data.
>>>>
>>>> Go!
>>>> My suggestion is
>>>>   start small
>>>>   make steady progress
>>>>   write tests
>>>>   commit often :)
>>>>
>>>> Stef
>>>>
>>>> What is the french phoneBook ordering because this is the first time I hear
>>>> about it.
>>>>
>>>>>
>>>>> Call to action:
>>>>> ==========
>>>>>
>>>>> If you have comments on these proposals - such as "but we already have
>>>>> that facility" or "the reason we do not have these facilities is
>>>>> because they are dog-slow" - please let me know them.
>>>>>
>>>>> If you would like to help out, please let me know.
>>>>>
>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>> please let me know.
>>>>>
>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>
>>>>> In the first instance, the initiative's website will be:
>>>>> http://smalltalk.uk.to/unicode.html
>>>>>
>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>
>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>> Except where they prevent us reaching the goals of:
>>>>>  - sortable UTF8 strings
>>>>>  - sortable UTF16 strings
>>>>>  - equivalence testing of 2 UTF8 strings
>>>>>  - equivalence testing of 2 UTF16 strings
>>>>>  - round-tripping UTF8 strings through Smalltalk
>>>>>  - roundtripping UTF16 strings through Smalltalk.
>>>>> As I understand it, we have limited Unicode support atm.
>>>>>
>>>>> Current state of play
>>>>> ===============
>>>>> ByteString gets converted to WideString when need is automagically
>>>>> detected.
>>>>>
>>>>> Is there anything else that currently exists?
>>>>>
>>>>> Definition of Terms
>>>>> ==============
>>>>> A quick definition of terms before I go any further:
>>>>>
>>>>> Standard terms from the Unicode standard
>>>>> ===============================
>>>>> a compatibility character : an additional encoding of a *normal*
>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>
>>>>> Made-up terms
>>>>> ============
>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>> that is also encoded as a string of codepoints.
>>>>>
>>>>> (I tend to use the terms compatibility character and compatibility
>>>>> codepoint interchangably.  The standard only refers to them as
>>>>> compatibility characters.  However, the standard is determined to
>>>>> emphasise that characters are abstract and that codepoints are
>>>>> concrete.  So I think it is often more useful and productive to think
>>>>> of compatibility or convenience codepoints).
>>>>>
>>>>> a composed character :  a character made up of several codepoints
>>>>>
>>>>> Unicode encoding explained
>>>>> =====================
>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>> used for a character which also has a composed form.
>>>>>
>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>> sometimes not.
>>>>>
>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>> stream, it would be represented either by a compatibility character or
>>>>> by a multi-byte combination.
>>>>>
>>>>> Using compatibility characters can prevent proper sorting and
>>>>> equivalence testing.
>>>>>
>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>> compatibility issues and round-tripping problems.
>>>>>
>>>>> Currently my thinking is:
>>>>>
>>>>> a Utf8String class
>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>> but short arrays of wider strings where necessary
>>>>> a Utf16String class
>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>> but short arrays of wider strings
>>>>> beginning with a 2-byte endianness indicator.
>>>>>
>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>> compatible.
>>>>>
>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>> for round-tripping.  And where there are multiple convenience
>>>>> codepoints for a character, that it standardises on one.
>>>>>
>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>> characters.
>>>>>
>>>>> We then need methods to convert between the two.
>>>>>
>>>>> aUtf8String asUtf8SortableString
>>>>>
>>>>> and
>>>>>
>>>>> aUtf8SortableString asUtf8String
>>>>>
>>>>>
>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>> countries have one order in general usage, and another for specific
>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>
>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>> conversion methods
>>>>>
>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>> seOrder, ukOrder, etc
>>>>>
>>>>> along the lines of
>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>
>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>> then we can perform equivalence testing on them trivially.
>>>>>
>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>> were for a character which has multiple equally-valid alternative
>>>>> convenience codepoints, and for which the string currently had the
>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>> alternative convenience codepoints out of the string, and replacing
>>>>> them with the chosen approved convenience codepoint.
>>>>>
>>>>> aUtf8String cleanUtf8String
>>>>>
>>>>> With WideString, a lot of the issues disappear - except
>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>> some Unicode characters 5-bytes long.)
>>>>>
>>>>>
>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>> subtle, or somewhere in between, please let me know)
>>>>>
>>>>> Cheers,
>>>>>    Euan
>>>>>
>>>>>
>>>>
>>>>
>>
>>> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote:
>>>
>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>> http://smalltalk.uk.to/unicode-utf8.html
>>> and my Smalltalk in Small Steps blog at:
>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>>
>>> My current thinking, and understanding.
>>> ==============================
>>>
>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>    b) UTF-8 can encode all of those characters in 1 byte, but can
>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>> can encode additional characters as sequences of multiple bytes.
>>>
>>> 1) Smalltalk has long had multiple String classes.
>>>
>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>    is encoded as a UTF-8 codepoint of nn hex.
>>>
>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>> hex and 7E hex, or between A0 hex and FF hex.
>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>>
>>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>>> hex.
>>> https://en.wikipedia.org/wiki/ASCII
>>>
>>>
>>> 5) a) All character codes which are defined within ISO-8859-1 and also
>>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>>> defined identically in both.
>>>
>>> b) All printable ASCII characters are defined identically in both
>>> ASCII and ISO-8859-1
>>>
>>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>>> defined identically in Unicode UTF-8.
>>>
>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>>> - FF hex ) are defined identically in UTF-8.
>>>
>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>>         all ASCII maps 1:1 to Unicode UTF-8
>>>         all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>>
>>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>>> character  or a valid ASCII character are *also* a valid UTF-8
>>> character.
>>>
>>> 10) ISO-8859-1 characters representing a character with a diacritic,
>>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>>> UTF-8, those character codes which are representing compound glyphs,
>>> are called "compatibility codepoints".
>>>
>>> 11) The preferred Unicode representation of the characters which have
>>> compatibility codepoints is as a  a short set of codepoints
>>> representing the characters which are combined together to form the
>>> glyph of the convenience codepoint, as a sequence of bytes
>>> representing the component characters.
>>>
>>>
>>> 12) Some concrete examples:
>>>
>>> A - aka Upper Case A
>>> In ASCII, in ISO 8859-1
>>> ASCII A - 41 hex
>>> ISO-8859-1 A - 41 hex
>>> UTF-8 A - 41 hex
>>>
>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>>> In ASCII, not in ISO 8859-1
>>> ASCII : BEL  - 07 hex
>>> ISO-8859-1 : 07 hex is not a valid character code
>>> UTF-8 : BEL - 07 hex
>>>
>>> £ (GBP currency symbol)
>>> In ISO-8859-1, not in ASCII
>>> ASCII : A3 hex is not a valid ASCII code
>>> UTF-8: £ - A3 hex
>>> ISO-8859-1: £ - A3 hex
>>>
>>> Upper Case C cedilla
>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>>> *and* a composed set of codepoints
>>> ASCII : C7 hex is not a valid ASCII character code
>>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>>   Upper case C 0043 hex (Upper case C)
>>>       followed by
>>>   cedilla 00B8 hex (cedilla)
>>>
>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>>> aByteString is completely adequate for editing and display.
>>>
>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>>> string, upper and lower case versions of the same character will be
>>> treated differently.
>>>
>>> 15) When sorting any valid ISO-8859-1 string containing
>>> letter+diacritic combination glyphs or ligature combination glyphs,
>>> the glyphs in combination will treated differently to a "plain" glyph
>>> of the character
>>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>>> "fs" will be treated very differently.
>>>
>>> 16) Different nations have different rules about where diacritic-ed
>>> characted and ligature pairs should be placed when in alphabetical
>>> order.
>>>
>>> 17) Some nations even have multiple standards - e.g.  surnames
>>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>>> are treated as beginning equivalently in UK phone directories, but not
>>> in other situations.
>>>
>>>
>>> Some practical upshots
>>> ==================
>>>
>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>>> for any single character it considers valid, or any ByteString it has
>>> made up of characters it considers valid.
>>>
>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>>> other Smalltalk with a single byte ByteString following ASCII or
>>> ISO-8859-1.
>>>
>>> 3) Any Smalltalk (or derivative language) using ByteString can
>>> immediately consider it's ByteString as valid UTF-8, as long as it
>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>>
>>> 4) All of those can be successfully exported to any system using UTF-8
>>> (e.g. HTML).
>>>
>>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>>> a) accept UTF-8 strings with composed characters
>>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>>> that use *only* compatibility codepoints.
>>>
>>>
>>> Class + protocol proposals
>>>
>>>
>>>
>>> a Utf8CompatibilityString class.
>>>
>>>   asByteString  - ensure only compatibility codepoints are used.
>>> Ensure it doews not encode characters above 00FF hex.
>>>
>>>   asIso8859String - ensures only compatibility codepoints are used,
>>> and that the characters are each valid ISO 8859-1
>>>
>>>   asAsciiString - ensures only characters 00hex - 7F hex are used.
>>>
>>>   asUtf8ComposedIso8859String - ensures all compatibility codepoints
>>> are expanded into small OrderedCollections of codepoints
>>>
>>> a Utf8ComposedIso8859String class - will provide sortable and
>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>>
>>> Then a Utf8SortableCollection class - a collection of
>>> Utf8ComposedIso8859Strings words and phrases.
>>>
>>> Custom sortBlocks will define the applicable sort order.
>>>
>>> We can create a collection...  a Dictionary, thinking about it, of
>>> named, prefabricated sortBlocks.
>>>
>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>>
>>> If anyone has better names for the classes, please let me know.
>>>
>>> If anyone else wants to help
>>>    - build these,
>>>    - create SUnit tests for these
>>>    - write documentation for these
>>> Please let me know.
>>>
>>> n.b. I have had absolutely no experience of Ropes.
>>>
>>> My own background with this stuff:  In the early 90's as a Project
>>> Manager implementing office automation systems across a global
>>> company, with offices in the Americas, Western, Eastern and Central
>>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>>> China. The mission-critical application was word-processing.
>>>
>>> Our offices were spread around the globe, and we needed those offices
>>> to successfully exchange documents with their sister offices, and with
>>> the customers in each region the offices were in.
>>>
>>> Unicode was then new, and our platform supplier was the NeXT
>>> Corporation, who had been founder members in of the Unicode Consortium
>>> in 1990.
>>>
>>> So far: I've read the latest version of the Unicode Standard (v8.0).
>>> This is freely downloadable.
>>> I've purchased a paper copy of an earlier release.  New releases
>>> typically consist additional codespaces (i.e. alphabets).  So old
>>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>>> are available second-hand for < $10 / €10).
>>>
>>> The typical change with each release is the addition of further
>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>>> (I'll be going through my V4.0 just to make sure)
>>>
>>> Cheers,
>>>   Euan
>>>
>>>
>>>
>>>
>>> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote:
>>>> Hi EuanM
>>>>
>>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>>
>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>> be done to provide full Unicode support.
>>>>
>>>>
>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>>> (analyses on different topics)
>>>> on the pharo github project feel free to add this one there.
>>>>>
>>>>>
>>>>> This seems to me to be an area where it would be best to write it
>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>> that most share a common ancestry.
>>>>>
>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>> strings which have ligatures and diacritic characters; and correct
>>>>> round-tripping of data.
>>>>
>>>> Go!
>>>> My suggestion is
>>>>    start small
>>>>    make steady progress
>>>>    write tests
>>>>    commit often :)
>>>>
>>>> Stef
>>>>
>>>> What is the french phoneBook ordering because this is the first time I hear
>>>> about it.
>>>>
>>>>>
>>>>> Call to action:
>>>>> ==========
>>>>>
>>>>> If you have comments on these proposals - such as "but we already have
>>>>> that facility" or "the reason we do not have these facilities is
>>>>> because they are dog-slow" - please let me know them.
>>>>>
>>>>> If you would like to help out, please let me know.
>>>>>
>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>> please let me know.
>>>>>
>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>
>>>>> In the first instance, the initiative's website will be:
>>>>> http://smalltalk.uk.to/unicode.html
>>>>>
>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>
>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>> Except where they prevent us reaching the goals of:
>>>>>   - sortable UTF8 strings
>>>>>   - sortable UTF16 strings
>>>>>   - equivalence testing of 2 UTF8 strings
>>>>>   - equivalence testing of 2 UTF16 strings
>>>>>   - round-tripping UTF8 strings through Smalltalk
>>>>>   - roundtripping UTF16 strings through Smalltalk.
>>>>> As I understand it, we have limited Unicode support atm.
>>>>>
>>>>> Current state of play
>>>>> ===============
>>>>> ByteString gets converted to WideString when need is automagically
>>>>> detected.
>>>>>
>>>>> Is there anything else that currently exists?
>>>>>
>>>>> Definition of Terms
>>>>> ==============
>>>>> A quick definition of terms before I go any further:
>>>>>
>>>>> Standard terms from the Unicode standard
>>>>> ===============================
>>>>> a compatibility character : an additional encoding of a *normal*
>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>
>>>>> Made-up terms
>>>>> ============
>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>> that is also encoded as a string of codepoints.
>>>>>
>>>>> (I tend to use the terms compatibility character and compatibility
>>>>> codepoint interchangably.  The standard only refers to them as
>>>>> compatibility characters.  However, the standard is determined to
>>>>> emphasise that characters are abstract and that codepoints are
>>>>> concrete.  So I think it is often more useful and productive to think
>>>>> of compatibility or convenience codepoints).
>>>>>
>>>>> a composed character :  a character made up of several codepoints
>>>>>
>>>>> Unicode encoding explained
>>>>> =====================
>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>> used for a character which also has a composed form.
>>>>>
>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>> sometimes not.
>>>>>
>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>> stream, it would be represented either by a compatibility character or
>>>>> by a multi-byte combination.
>>>>>
>>>>> Using compatibility characters can prevent proper sorting and
>>>>> equivalence testing.
>>>>>
>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>> compatibility issues and round-tripping problems.
>>>>>
>>>>> Currently my thinking is:
>>>>>
>>>>> a Utf8String class
>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>> but short arrays of wider strings where necessary
>>>>> a Utf16String class
>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>> but short arrays of wider strings
>>>>> beginning with a 2-byte endianness indicator.
>>>>>
>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>> compatible.
>>>>>
>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>> for round-tripping.  And where there are multiple convenience
>>>>> codepoints for a character, that it standardises on one.
>>>>>
>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>> characters.
>>>>>
>>>>> We then need methods to convert between the two.
>>>>>
>>>>> aUtf8String asUtf8SortableString
>>>>>
>>>>> and
>>>>>
>>>>> aUtf8SortableString asUtf8String
>>>>>
>>>>>
>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>> countries have one order in general usage, and another for specific
>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>
>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>> conversion methods
>>>>>
>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>> seOrder, ukOrder, etc
>>>>>
>>>>> along the lines of
>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>
>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>> then we can perform equivalence testing on them trivially.
>>>>>
>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>> were for a character which has multiple equally-valid alternative
>>>>> convenience codepoints, and for which the string currently had the
>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>> alternative convenience codepoints out of the string, and replacing
>>>>> them with the chosen approved convenience codepoint.
>>>>>
>>>>> aUtf8String cleanUtf8String
>>>>>
>>>>> With WideString, a lot of the issues disappear - except
>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>> some Unicode characters 5-bytes long.)
>>>>>
>>>>>
>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>> subtle, or somewhere in between, please let me know)
>>>>>
>>>>> Cheers,
>>>>>     Euan
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
Re: [Pharo-dev] Unicode Support

Reply via email to