Verifying assumptions is the key reason why you should documents like this out for review.
Sven - Cuis is encoded with ISO 8859-15 (aka ISO Latin 9) Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1). We caught the right specification bug for the wrong reason. Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base image include and use only 1-byte strings. Chose to use ISO-8859-15" I have double-checked - each character encoded in ISO Latin 15 (ISO 8859-15) is exactly the character represented by the corresponding 1-byte codepoint in Unicode 0000 to 00FF, with the following exceptions: codepoint 20ac - Euro symbol character code a4 (replaces codepoint 00a4 generic currency symbol) codepoint 0160 Latin Upper Case S with Caron character code a6 (replaces codepoint 00A6 was | Unix pipe character) codepoint 0161 Latin Lower Case s with Caron character code a8 (replaces codepoint 00A8 was dierisis) codepoint 017d Latin Upper Case Z with Caron character code b4 (replaces codepoint 00b4 was Acute accent) codepoint 017e Latin Lower Case Z with Caron character code b8 (replaces codepoint 00b8 was cedilla) codepoint 0152 Upper Case OE ligature = Ethel character code bc (replaces codepoint 00bc was 1/4 symbol) codepoint 0153 Lower Case oe ligature = ethel character code bd (replaces codepoint 00bd was 1/2 symbol) codepoint 0178 Upper Case Y diaeresis character code be (replaces codepoint 00be was 3/4 symbol) Juan - I don't suppose we could persuade you to change to ISO Latin-1 from ISO Latin-9 ? It means we could run the same 1 byte string encoding across Cuis, Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk and Gnu Smalltalk. The downside would be that French Y diaeresis would lose the ability to use that character, along with users of oe, OE, and s, S, z, Z with caron. Along with the Euro. https://en.wikipedia.org/wiki/ISO/IEC_8859-15. I'm confident I understand the use of UTF-8 in principal. On 7 December 2015 at 08:27, Sven Van Caekenberghe <[email protected]> wrote: > I am sorry but one of your basic assumptions is completely wrong: > > 'Les élèves Français' encodeWith: #iso99591. > > => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115] > > 'Les élèves Français' utf8Encoded. > > => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 > 97 105 115] > > ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII > part !! > > Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8. > > So more than half the points you make, or the facts that you state, are thus > plain wrong. > > The only thing that is correct is that the code points are equal, but that is > not the same as the encoding ! > > From this I am inclined to conclude that you do not fundamentally understand > how UTF-8 works, which does not strike me as good basis to design something > called a UTF8String. > > Sorry. > > PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in > a Unicode world. > >> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E >> hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> > >> On 07 Dec 2015, at 04:21, EuanM <[email protected]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E >> hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[email protected]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> >> > >
