Re: [Pharo-dev] Unicode Support

EuanM Mon, 07 Dec 2015 02:52:51 -0800

Verifying assumptions is the key reason why you should documents like
this out for review.


Sven -

Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)

Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).

We caught the right specification bug for the wrong reason.

Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
image include and use only 1-byte strings. Chose to use ISO-8859-15"

I have double-checked - each character encoded in ISO Latin 15 (ISO
8859-15) is exactly the character represented by the corresponding
1-byte codepoint in Unicode 0000 to 00FF,

with the following exceptions:

codepoint 20ac - Euro symbol
character code a4 (replaces codepoint 00a4 generic currency symbol)

codepoint 0160 Latin Upper Case S with Caron
character code a6  (replaces codepoint 00A6 was | Unix pipe character)

codepoint 0161 Latin Lower Case s with Caron
character code a8 (replaces codepoint 00A8 was dierisis)

codepoint 017d Latin Upper Case Z with Caron
character code b4 (replaces codepoint 00b4 was Acute accent)

codepoint 017e Latin Lower Case Z with Caron
character code b8 (replaces codepoint 00b8 was cedilla)

codepoint 0152 Upper Case OE ligature = Ethel
character code bc (replaces codepoint 00bc was 1/4 symbol)

codepoint 0153 Lower Case oe ligature = ethel
character code bd (replaces codepoint 00bd was 1/2 symbol)

codepoint 0178 Upper Case Y diaeresis
character code be (replaces codepoint 00be was 3/4 symbol)

Juan - I don't suppose we could persuade you to change to ISO  Latin-1
from ISO Latin-9 ?

It means we could run the same 1 byte string encoding across  Cuis,
Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
and Gnu Smalltalk.

The downside would be that French Y diaeresis would lose the ability
to use that character, along with users of oe, OE, and s, S, z, Z with
caron.  Along with the Euro.

https://en.wikipedia.org/wiki/ISO/IEC_8859-15.

I'm confident I understand the use of UTF-8 in principal.


On 7 December 2015 at 08:27, Sven Van Caekenberghe <[email protected]> wrote:
> I am sorry but one of your basic assumptions is completely wrong:
>
> 'Les élèves Français' encodeWith: #iso99591.
>
> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>
> 'Les élèves Français' utf8Encoded.
>
> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 
> 97 105 115]
>
> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
> part !!
>
> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.
>
> So more than half the points you make, or the facts that you state, are thus 
> plain wrong.
>
> The only thing that is correct is that the code points are equal, but that is 
> not the same as the encoding !
>
> From this I am inclined to conclude that you do not fundamentally understand 
> how UTF-8 works, which does not strike me as good basis to design something 
> called a UTF8String.
>
> Sorry.
>
> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in 
> a Unicode world.
>
>> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote:
>>
>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>> http://smalltalk.uk.to/unicode-utf8.html
>> and my Smalltalk in Small Steps blog at:
>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>
>> My current thinking, and understanding.
>> ==============================
>>
>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>   b) UTF-8 can encode all of those characters in 1 byte, but can
>> prefer some of them to be encoded as sequences of multiple bytes.  And
>> can encode additional characters as sequences of multiple bytes.
>>
>> 1) Smalltalk has long had multiple String classes.
>>
>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>   is encoded as a UTF-8 codepoint of nn hex.
>>
>> 3) All valid ISO-8859-1 characters have a character code between 20
>> hex and 7E hex, or between A0 hex and FF hex.
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>
>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>> hex.
>> https://en.wikipedia.org/wiki/ASCII
>>
>>
>> 5) a) All character codes which are defined within ISO-8859-1 and also
>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>> defined identically in both.
>>
>> b) All printable ASCII characters are defined identically in both
>> ASCII and ISO-8859-1
>>
>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>> defined identically in Unicode UTF-8.
>>
>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>> - FF hex ) are defined identically in UTF-8.
>>
>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>        all ASCII maps 1:1 to Unicode UTF-8
>>        all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>
>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>> character  or a valid ASCII character are *also* a valid UTF-8
>> character.
>>
>> 10) ISO-8859-1 characters representing a character with a diacritic,
>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>> UTF-8, those character codes which are representing compound glyphs,
>> are called "compatibility codepoints".
>>
>> 11) The preferred Unicode representation of the characters which have
>> compatibility codepoints is as a  a short set of codepoints
>> representing the characters which are combined together to form the
>> glyph of the convenience codepoint, as a sequence of bytes
>> representing the component characters.
>>
>>
>> 12) Some concrete examples:
>>
>> A - aka Upper Case A
>> In ASCII, in ISO 8859-1
>> ASCII A - 41 hex
>> ISO-8859-1 A - 41 hex
>> UTF-8 A - 41 hex
>>
>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>> In ASCII, not in ISO 8859-1
>> ASCII : BEL  - 07 hex
>> ISO-8859-1 : 07 hex is not a valid character code
>> UTF-8 : BEL - 07 hex
>>
>> £ (GBP currency symbol)
>> In ISO-8859-1, not in ASCII
>> ASCII : A3 hex is not a valid ASCII code
>> UTF-8: £ - A3 hex
>> ISO-8859-1: £ - A3 hex
>>
>> Upper Case C cedilla
>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>> *and* a composed set of codepoints
>> ASCII : C7 hex is not a valid ASCII character code
>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>  Upper case C 0043 hex (Upper case C)
>>      followed by
>>  cedilla 00B8 hex (cedilla)
>>
>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>> aByteString is completely adequate for editing and display.
>>
>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>> string, upper and lower case versions of the same character will be
>> treated differently.
>>
>> 15) When sorting any valid ISO-8859-1 string containing
>> letter+diacritic combination glyphs or ligature combination glyphs,
>> the glyphs in combination will treated differently to a "plain" glyph
>> of the character
>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>> "fs" will be treated very differently.
>>
>> 16) Different nations have different rules about where diacritic-ed
>> characted and ligature pairs should be placed when in alphabetical
>> order.
>>
>> 17) Some nations even have multiple standards - e.g.  surnames
>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>> are treated as beginning equivalently in UK phone directories, but not
>> in other situations.
>>
>>
>> Some practical upshots
>> ==================
>>
>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>> for any single character it considers valid, or any ByteString it has
>> made up of characters it considers valid.
>>
>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>> other Smalltalk with a single byte ByteString following ASCII or
>> ISO-8859-1.
>>
>> 3) Any Smalltalk (or derivative language) using ByteString can
>> immediately consider it's ByteString as valid UTF-8, as long as it
>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>
>> 4) All of those can be successfully exported to any system using UTF-8
>> (e.g. HTML).
>>
>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>> a) accept UTF-8 strings with composed characters
>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>> that use *only* compatibility codepoints.
>>
>>
>> Class + protocol proposals
>>
>>
>>
>> a Utf8CompatibilityString class.
>>
>>  asByteString  - ensure only compatibility codepoints are used.
>> Ensure it doews not encode characters above 00FF hex.
>>
>>  asIso8859String - ensures only compatibility codepoints are used,
>> and that the characters are each valid ISO 8859-1
>>
>>  asAsciiString - ensures only characters 00hex - 7F hex are used.
>>
>>  asUtf8ComposedIso8859String - ensures all compatibility codepoints
>> are expanded into small OrderedCollections of codepoints
>>
>> a Utf8ComposedIso8859String class - will provide sortable and
>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>
>> Then a Utf8SortableCollection class - a collection of
>> Utf8ComposedIso8859Strings words and phrases.
>>
>> Custom sortBlocks will define the applicable sort order.
>>
>> We can create a collection...  a Dictionary, thinking about it, of
>> named, prefabricated sortBlocks.
>>
>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>
>> If anyone has better names for the classes, please let me know.
>>
>> If anyone else wants to help
>>   - build these,
>>   - create SUnit tests for these
>>   - write documentation for these
>> Please let me know.
>>
>> n.b. I have had absolutely no experience of Ropes.
>>
>> My own background with this stuff:  In the early 90's as a Project
>> Manager implementing office automation systems across a global
>> company, with offices in the Americas, Western, Eastern and Central
>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>> China. The mission-critical application was word-processing.
>>
>> Our offices were spread around the globe, and we needed those offices
>> to successfully exchange documents with their sister offices, and with
>> the customers in each region the offices were in.
>>
>> Unicode was then new, and our platform supplier was the NeXT
>> Corporation, who had been founder members in of the Unicode Consortium
>> in 1990.
>>
>> So far: I've read the latest version of the Unicode Standard (v8.0).
>> This is freely downloadable.
>> I've purchased a paper copy of an earlier release.  New releases
>> typically consist additional codespaces (i.e. alphabets).  So old
>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>> are available second-hand for < $10 / €10).
>>
>> The typical change with each release is the addition of further
>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>> (I'll be going through my V4.0 just to make sure)
>>
>> Cheers,
>>  Euan
>>
>>
>>
>>
>> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote:
>>> Hi EuanM
>>>
>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>
>>>
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>> (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
>>>>
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>>
>>> Go!
>>> My suggestion is
>>>   start small
>>>   make steady progress
>>>   write tests
>>>   commit often :)
>>>
>>> Stef
>>>
>>> What is the french phoneBook ordering because this is the first time I hear
>>> about it.
>>>
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>>  - sortable UTF8 strings
>>>>  - sortable UTF16 strings
>>>>  - equivalence testing of 2 UTF8 strings
>>>>  - equivalence testing of 2 UTF16 strings
>>>>  - round-tripping UTF8 strings through Smalltalk
>>>>  - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically
>>>> detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>> compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal
>>>> characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>    Euan
>>>>
>>>>
>>>
>>>
>
>> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote:
>>
>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>> http://smalltalk.uk.to/unicode-utf8.html
>> and my Smalltalk in Small Steps blog at:
>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>
>> My current thinking, and understanding.
>> ==============================
>>
>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>    b) UTF-8 can encode all of those characters in 1 byte, but can
>> prefer some of them to be encoded as sequences of multiple bytes.  And
>> can encode additional characters as sequences of multiple bytes.
>>
>> 1) Smalltalk has long had multiple String classes.
>>
>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>    is encoded as a UTF-8 codepoint of nn hex.
>>
>> 3) All valid ISO-8859-1 characters have a character code between 20
>> hex and 7E hex, or between A0 hex and FF hex.
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>
>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>> hex.
>> https://en.wikipedia.org/wiki/ASCII
>>
>>
>> 5) a) All character codes which are defined within ISO-8859-1 and also
>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>> defined identically in both.
>>
>> b) All printable ASCII characters are defined identically in both
>> ASCII and ISO-8859-1
>>
>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>> defined identically in Unicode UTF-8.
>>
>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>> - FF hex ) are defined identically in UTF-8.
>>
>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>         all ASCII maps 1:1 to Unicode UTF-8
>>         all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>
>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>> character  or a valid ASCII character are *also* a valid UTF-8
>> character.
>>
>> 10) ISO-8859-1 characters representing a character with a diacritic,
>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>> UTF-8, those character codes which are representing compound glyphs,
>> are called "compatibility codepoints".
>>
>> 11) The preferred Unicode representation of the characters which have
>> compatibility codepoints is as a  a short set of codepoints
>> representing the characters which are combined together to form the
>> glyph of the convenience codepoint, as a sequence of bytes
>> representing the component characters.
>>
>>
>> 12) Some concrete examples:
>>
>> A - aka Upper Case A
>> In ASCII, in ISO 8859-1
>> ASCII A - 41 hex
>> ISO-8859-1 A - 41 hex
>> UTF-8 A - 41 hex
>>
>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>> In ASCII, not in ISO 8859-1
>> ASCII : BEL  - 07 hex
>> ISO-8859-1 : 07 hex is not a valid character code
>> UTF-8 : BEL - 07 hex
>>
>> £ (GBP currency symbol)
>> In ISO-8859-1, not in ASCII
>> ASCII : A3 hex is not a valid ASCII code
>> UTF-8: £ - A3 hex
>> ISO-8859-1: £ - A3 hex
>>
>> Upper Case C cedilla
>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>> *and* a composed set of codepoints
>> ASCII : C7 hex is not a valid ASCII character code
>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>   Upper case C 0043 hex (Upper case C)
>>       followed by
>>   cedilla 00B8 hex (cedilla)
>>
>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>> aByteString is completely adequate for editing and display.
>>
>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>> string, upper and lower case versions of the same character will be
>> treated differently.
>>
>> 15) When sorting any valid ISO-8859-1 string containing
>> letter+diacritic combination glyphs or ligature combination glyphs,
>> the glyphs in combination will treated differently to a "plain" glyph
>> of the character
>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>> "fs" will be treated very differently.
>>
>> 16) Different nations have different rules about where diacritic-ed
>> characted and ligature pairs should be placed when in alphabetical
>> order.
>>
>> 17) Some nations even have multiple standards - e.g.  surnames
>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>> are treated as beginning equivalently in UK phone directories, but not
>> in other situations.
>>
>>
>> Some practical upshots
>> ==================
>>
>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>> for any single character it considers valid, or any ByteString it has
>> made up of characters it considers valid.
>>
>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>> other Smalltalk with a single byte ByteString following ASCII or
>> ISO-8859-1.
>>
>> 3) Any Smalltalk (or derivative language) using ByteString can
>> immediately consider it's ByteString as valid UTF-8, as long as it
>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>
>> 4) All of those can be successfully exported to any system using UTF-8
>> (e.g. HTML).
>>
>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>> a) accept UTF-8 strings with composed characters
>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>> that use *only* compatibility codepoints.
>>
>>
>> Class + protocol proposals
>>
>>
>>
>> a Utf8CompatibilityString class.
>>
>>   asByteString  - ensure only compatibility codepoints are used.
>> Ensure it doews not encode characters above 00FF hex.
>>
>>   asIso8859String - ensures only compatibility codepoints are used,
>> and that the characters are each valid ISO 8859-1
>>
>>   asAsciiString - ensures only characters 00hex - 7F hex are used.
>>
>>   asUtf8ComposedIso8859String - ensures all compatibility codepoints
>> are expanded into small OrderedCollections of codepoints
>>
>> a Utf8ComposedIso8859String class - will provide sortable and
>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>
>> Then a Utf8SortableCollection class - a collection of
>> Utf8ComposedIso8859Strings words and phrases.
>>
>> Custom sortBlocks will define the applicable sort order.
>>
>> We can create a collection...  a Dictionary, thinking about it, of
>> named, prefabricated sortBlocks.
>>
>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>
>> If anyone has better names for the classes, please let me know.
>>
>> If anyone else wants to help
>>    - build these,
>>    - create SUnit tests for these
>>    - write documentation for these
>> Please let me know.
>>
>> n.b. I have had absolutely no experience of Ropes.
>>
>> My own background with this stuff:  In the early 90's as a Project
>> Manager implementing office automation systems across a global
>> company, with offices in the Americas, Western, Eastern and Central
>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>> China. The mission-critical application was word-processing.
>>
>> Our offices were spread around the globe, and we needed those offices
>> to successfully exchange documents with their sister offices, and with
>> the customers in each region the offices were in.
>>
>> Unicode was then new, and our platform supplier was the NeXT
>> Corporation, who had been founder members in of the Unicode Consortium
>> in 1990.
>>
>> So far: I've read the latest version of the Unicode Standard (v8.0).
>> This is freely downloadable.
>> I've purchased a paper copy of an earlier release.  New releases
>> typically consist additional codespaces (i.e. alphabets).  So old
>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>> are available second-hand for < $10 / €10).
>>
>> The typical change with each release is the addition of further
>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>> (I'll be going through my V4.0 just to make sure)
>>
>> Cheers,
>>   Euan
>>
>>
>>
>>
>> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote:
>>> Hi EuanM
>>>
>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>
>>>
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>> (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
>>>>
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>>
>>> Go!
>>> My suggestion is
>>>    start small
>>>    make steady progress
>>>    write tests
>>>    commit often :)
>>>
>>> Stef
>>>
>>> What is the french phoneBook ordering because this is the first time I hear
>>> about it.
>>>
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>>   - sortable UTF8 strings
>>>>   - sortable UTF16 strings
>>>>   - equivalence testing of 2 UTF8 strings
>>>>   - equivalence testing of 2 UTF16 strings
>>>>   - round-tripping UTF8 strings through Smalltalk
>>>>   - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically
>>>> detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>> compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal
>>>> characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>     Euan
>>>>
>>>>
>>>
>>>
>>
>
>

Re: [Pharo-dev] Unicode Support

Reply via email to