decomposition in Pharo?)

Sven Van Caekenberghe Mon, 21 Dec 2015 02:19:28 -0800

Andres,

There are no plans at all to drop any of the existing character encodings from 
Pharo. UTF-16 LE & BE will remain part of the standard image, as are all single 
byte encodings. No need to worry.


Sven

> On 19 Dec 2015, at 03:04, Andres Valloud <[email protected]> 
> wrote:
> 
> So a lot of Windows APIs require UTF-16.  What's up with UTF-8 being the only 
> choice mentioned for external communication?
> 
> Unicode string encodings like UTF-* and strings of "characters" (that is, 
> sequences of Unicode code points) should be clearly distinguished. Do you 
> really mean "UTF-32", or do you mean "UCS-4"?  Even those two are not exactly 
> the same.
> 
> On 12/18/15 5:47 , H. Hirzel wrote:
>> Hello Sven
>> 
>> Thank you for your report about about  your experimental, proof of
>> concept, prototype project, that aims to improve Unicode support.
>> Please include me in the loop.
>> 
>> Below is is my attempt at summarizing the Unicode discussion of the last 
>> weeks.
>> Corrections /comments / additions are welcome.
>> 
>> Kind regards
>> 
>> Hannes
>> 
>> 
>> 1) There is a need for improved Unicode support implemented _within_
>> the image , probably as a library.
>> 
>> 1a) This follows the example of the the Twitter CLDR library (i.e.
>> re-implementation of ICU components for Ruby).
>> https://github.com/twitter/twitter-cldr-rb
>> 
>> Other languages/libraries have similar approaches
>> - dotNet, 
>> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
>> - Python https://docs.python.org/3/howto/unicode.html
>> - Go http://blog.golang.org/strings
>> - Swift, 
>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> - Perl http://blog.golang.org/strings
>> 
>> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
>> is because of security and portability reasons (Eliot Miranda) and
>> because of the Smalltalk approach that wants to expose basic
>> algorithms in Smalltalk code. In addition the 16bit based ICU library
>> does not fit well with the Squeak/Pharo UTF32 model.
>> 
>> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate
>> objects, use of UTF-32 internally, indexable strings, UTF8 for outside
>> communication, support of code converters) is a very valuable
>> foundation which makes algorithms more straightforward at the expense
>> of a more memory usage. It not used to its full potential at all
>> currently though a lot of hard work has been done.
>> 
>> 3) The Unicode algorithms are mostly table / database driven. This
>> means that dictionary lookup is a prominent part of the algorithms.
>> The essential building block for this is that the Unicode character
>> database UCD  (http://www.unicode.org/ucd/) is made  available
>> _within_ the image with the full content as needed by the target
>> languages / scripts one wants to deal with. The process of loading the
>> UCD should be made configurable.
>> 
>> 3a) a lot of people are interested in the Latin script (and scripts of
>> similar complexity) only.
>> 3b) The UCD data in XML form
>> http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
>> and without the CJK characters.
>> 
>> 4) The next step is to implement normalization
>> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
>> you have reached results here with the test data:
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.
>> 
>> 5) Pharo offers nice inspectors to view dictionaries and ordered
>> collections (table view, drill down) which facilitates the development
>> to table driven algorithms. The data structures and algorithm are do
>> not depend on a particular dialect though and may be ported to Squeak
>> or Cuis.
>> 
>> 6) After having implemented normalization, comparison may be
>> implemented. This needs CLDR access (collation, Unicode Common Locale
>> Data Repository, http://cldr.unicode.org/ ).
>> 
>> 
>> 7) An architecture has the following subsystems
>> 
>> 7a) Basic character handling (21(32)bit characters in indexable
>> strings, point 2)
>> 7b) Runtime access to the Unicode Character Database (point 3)
>> 7c) Converters
>> 7d) Normalization (point 4)
>> 7e) CLDR access (point 6)
>> 
>> 
>> 8) The implementation should be driven by the current needs.
>> 
>> An attainable next goal is to release
>> 
>> 8a) a StringBuilder utility class for easier construction of test strings
>> i.e. instead of
>> 
>>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>>> String).
>> 
>> do
>> normalizer composeString:
>> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')
>> 
>> and construct some test cases with it which illustrate some basic
>> Unicode issues.
>> 
>> 8b) identity testing for major languages (e.g. French, German,
>> Spanish) and scripts of similar complexity. I
>> 
>> 8c) to provide some more documentation of past and concurrent efforts.
>> 
>> Note: This summary has only covered string manipulation, not rendering
>> on the screen which is a different issue.
>> 
>> 
>> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote:
>>> Hi Hannes,
>>> 
>>> My detailed comments/answers below, after quoting 2 of your emails:
>>> 
>>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote:
>>>> 
>>>> Hello Sven
>>>> 
>>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote:
>>>> 
>>>>> The simplest example in a common language is (the French letter é) is
>>>>> 
>>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>>> 
>>>>> which can also be written as
>>>>> 
>>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>>>> [U+0301]
>>>>> 
>>>>> The former being a composed normal form, the latter a decomposed normal
>>>>> form. (And yes, it is even much more complicated than that, it goes on
>>>>> for
>>>>> 1000s of pages).
>>>>> 
>>>>> In the above example, the concept of character/string is indeed fuzzy.
>>>>> 
>>>>> HTH,
>>>>> 
>>>>> Sven
>>>> 
>>>> Thanks for this example. I have created a wiki page with it
>>>> 
>>>> I wonder what the Pharo equivalent is of the following Squeak expression
>>>> 
>>>>   $é asString asDecomposedUnicode
>>>> 
>>>> Regards
>>>> 
>>>> Hannes
>>> 
>>> You also wrote:
>>> 
>>>> The text below shows how to deal with the  Unicode e acute example
>>>> brought up by Sven in terms of comparing strings. Currently Pharo and
>>>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>>>> It will be shown how NFD normalization may be implemented.
>>>> 
>>>> 
>>>> Swift programming language
>>>> -----------------------------------------
>>>> 
>>>> How does the Swift programming language [1] deal with Unicode strings?
>>>> 
>>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>>>   let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>>> 
>>>>   // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>>>> COMBINING ACUTE ACCENT
>>>>   let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>>> 
>>>>   if eAcuteQuestion == combinedEAcuteQuestion {
>>>>   print("These two strings are considered equal")
>>>>   }
>>>>   // prints "These two strings are considered equal"
>>>> 
>>>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>>>> form for the comparison appyling a method
>>>> #decomposedStringWithCanonicalMapping[3]
>>>> 
>>>> 
>>>> Squeak / Pharo
>>>> -----------------------
>>>> 
>>>> Comparison without NFD [3]
>>>> 
>>>> 
>>>> "Voulez-vous un café?"
>>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>>> asString, '?'.
>>>> 
>>>> 
>>>> eAcuteQuestion = combinedEAcuteQuestion
>>>> false
>>>> 
>>>> eAcuteQuestion == combinedEAcuteQuestion
>>>> false
>>>> 
>>>> The result is false. A Unicode conformant application however should
>>>> return *true*.
>>>> 
>>>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>>>> before  testing for equality =
>>>> 
>>>> 
>>>> Squeak Unicode strings may be tested for Unicode conformant equality
>>>> by converting them to NFD before testing.
>>>> 
>>>> 
>>>> 
>>>> Squeak using NFD
>>>> 
>>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>>>> Unicode code point if decomposed, is decomposed only to two code
>>>> points [5]. This is so because when initializing [6] the Unicode
>>>> Character Database in Squeak this is a limitation imposed by the code
>>>> which reads UnicodeData.txt [7][8]. This is not a necessary
>>>> limitation. The code may be rewritten at the price of a more complex
>>>> implementation of #asDecomposedUnicode.
>>>> 
>>>> "Voulez-vous un café?"
>>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>>> asString, '?'.
>>>> 
>>>> 
>>>> eAcuteQuestion asDecomposedUnicode =
>>>>   combinedEAcuteQuestion  asDecomposedUnicode
>>>> 
>>>> true
>>>> 
>>>> 
>>>> 
>>>> Conclusion
>>>> ------------------
>>>> 
>>>> Implementing a method like #decomposedStringWithCanonicalMapping
>>>> (swift) which puts a string into NFD (Normalization Form D) is an
>>>> important building block towards better Unicode compliance. A Squeak
>>>> proposal is given by [4]. It needs to be reviewed.extended.
>>>> 
>>>> It should probably  be extended for cases where there are more than
>>>> two code points in the decomposed form (3 or more?)
>>>> 
>>>> The implementing of NFD comparison gives us an equality test for a
>>>> comparatively small effort for simple cases covering a large number of
>>>> use cases (Languages using the Latin script).
>>>> 
>>>> The algorithm is table driven by the UCD [8]. From this follows an
>>>> simple but important fact for conformant implementations need runtime
>>>> access to information from the Unicode Character Database [UCD][9].
>>>> 
>>>> 
>>>> [1]
>>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>>>> [2] http://www.unicode.org/glossary/#normalization_form_d
>>>> [3]
>>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>>>> [5] http://www.unicode.org/glossary/#code_point
>>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>>> [8] Unicode Character Database documentation
>>>> http://unicode.org/reports/tr44/
>>>> [9] http://www.unicode.org/reports/tr23/
>>> 
>>> 
>>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
>>> different but similar Unicode code in Squeak. These are too simple (even
>>> though they might work, partially).
>>> 
>>> The scope of the original threads is way too wide: a new string type,
>>> normalisation, collation, being cross dialect, mixing all kinds of character
>>> and encoding definitions. All interesting, but not much will come out of it.
>>> But the point that we cannot leave proper text string handling to an outside
>>> library is indeed key.
>>> 
>>> That is why a couple of people in the Pharo community (myself included)
>>> started an experimental, proof of concept, prototype project, that aims to
>>> improve Unicode support. We will announce it to a wider public when we feel
>>> we have something to show for. The goal is in the first place to understand
>>> and implement the fundamental algorithms, starting with the 4 forms of
>>> Normalisation. But we're working on collation/sorting too.
>>> 
>>> This work is of course being done for/in Pharo, using some of the facilities
>>> only available there. It probably won't be difficult to port, but we can't
>>> be bothered with probability right now.
>>> 
>>> What we started with is loading UCD data and making it available as a nice
>>> objects (30.000 of them).
>>> 
>>> So now you can do things like
>>> 
>>> $é unicodeCharacterData.
>>> 
>>> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>>> 
>>> $é unicodeCharacterData uppercase asCharacter.
>>> 
>>> => "$É"
>>> 
>>> $é unicodeCharacterData decompositionMapping.
>>> 
>>> => "#(101 769)"
>>> 
>>> There is also a cool GT Inspector view:
>>> 
>>> 
>>> 
>>> Next we started implementing a normaliser. It was rather easy to get support
>>> for simpler languages going. The next code snippets use explicit code
>>> arrays, because copying decomposed diacritics to my mail client does not
>>> work (they get automatically composed), in a Pharo Workspace this does work
>>> nicely with plain strings. The higher numbers are the diacritics.
>>> 
>>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
>>> Array.
>>> 
>>> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
>>> 807 97 105 115)"
>>> 
>>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
>>> as: Array.
>>> 
>>> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
>>> 115 97 108 108 101 101)"
>>> 
>>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
>>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>>> 
>>> => "'les élèves Français'"
>>> 
>>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>>> String).
>>> 
>>> => "'Düsseldorf Königsallee'"
>>> 
>>> However, the real algorithm following the official specification (and other
>>> elements of Unicode that interact with it) is way more complicated (think
>>> about all those special languages/scripts out there). We're focused on
>>> understanding/implementing that now.
>>> 
>>> Next, unit tests were added (of course). As well as a test that uses
>>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
>>> 75.000 individual test cases to check conformance to the official Unicode
>>> Normalization specification.
>>> 
>>> Right now (with super cool hangul / jamo code by Henrik), we hit the
>>> following stats:
>>> 
>>> #testNFC 16998/18593 (91.42%)
>>> #testNFD 16797/18593 (90.34%)
>>> #testNFKC 13321/18593 (71.65%)
>>> #testNFKD 16564/18593 (89.09%)
>>> 
>>> Way better than the naive implementations, but not yet there.
>>> 
>>> We are also experimenting and thinking a lot about how to best implement all
>>> this, trying out different models/ideas/apis/representations.
>>> 
>>> It will move slowly, but you will hear from us again in the coming
>>> weeks/months.
>>> 
>>> Sven
>>> 
>>> PS: Pharo developers with a good understanding of this subject area that
>>> want to help, let me know and we'll put you in the loop. Hacking and
>>> specification reading are required ;-)
>>> 
>>> 
>> 
>> .
>> 

> On 19 Dec 2015, at 03:04, Andres Valloud <[email protected]> 
> wrote:
> 
> So a lot of Windows APIs require UTF-16.  What's up with UTF-8 being the only 
> choice mentioned for external communication?
> 
> Unicode string encodings like UTF-* and strings of "characters" (that is, 
> sequences of Unicode code points) should be clearly distinguished. Do you 
> really mean "UTF-32", or do you mean "UCS-4"?  Even those two are not exactly 
> the same.
> 
> On 12/18/15 5:47 , H. Hirzel wrote:
>> Hello Sven
>> 
>> Thank you for your report about about  your experimental, proof of
>> concept, prototype project, that aims to improve Unicode support.
>> Please include me in the loop.
>> 
>> Below is is my attempt at summarizing the Unicode discussion of the last 
>> weeks.
>> Corrections /comments / additions are welcome.
>> 
>> Kind regards
>> 
>> Hannes
>> 
>> 
>> 1) There is a need for improved Unicode support implemented _within_
>> the image , probably as a library.
>> 
>> 1a) This follows the example of the the Twitter CLDR library (i.e.
>> re-implementation of ICU components for Ruby).
>> https://github.com/twitter/twitter-cldr-rb
>> 
>> Other languages/libraries have similar approaches
>> - dotNet, 
>> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
>> - Python https://docs.python.org/3/howto/unicode.html
>> - Go http://blog.golang.org/strings
>> - Swift, 
>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> - Perl http://blog.golang.org/strings
>> 
>> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
>> is because of security and portability reasons (Eliot Miranda) and
>> because of the Smalltalk approach that wants to expose basic
>> algorithms in Smalltalk code. In addition the 16bit based ICU library
>> does not fit well with the Squeak/Pharo UTF32 model.
>> 
>> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate
>> objects, use of UTF-32 internally, indexable strings, UTF8 for outside
>> communication, support of code converters) is a very valuable
>> foundation which makes algorithms more straightforward at the expense
>> of a more memory usage. It not used to its full potential at all
>> currently though a lot of hard work has been done.
>> 
>> 3) The Unicode algorithms are mostly table / database driven. This
>> means that dictionary lookup is a prominent part of the algorithms.
>> The essential building block for this is that the Unicode character
>> database UCD  (http://www.unicode.org/ucd/) is made  available
>> _within_ the image with the full content as needed by the target
>> languages / scripts one wants to deal with. The process of loading the
>> UCD should be made configurable.
>> 
>> 3a) a lot of people are interested in the Latin script (and scripts of
>> similar complexity) only.
>> 3b) The UCD data in XML form
>> http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
>> and without the CJK characters.
>> 
>> 4) The next step is to implement normalization
>> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
>> you have reached results here with the test data:
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.
>> 
>> 5) Pharo offers nice inspectors to view dictionaries and ordered
>> collections (table view, drill down) which facilitates the development
>> to table driven algorithms. The data structures and algorithm are do
>> not depend on a particular dialect though and may be ported to Squeak
>> or Cuis.
>> 
>> 6) After having implemented normalization, comparison may be
>> implemented. This needs CLDR access (collation, Unicode Common Locale
>> Data Repository, http://cldr.unicode.org/ ).
>> 
>> 
>> 7) An architecture has the following subsystems
>> 
>> 7a) Basic character handling (21(32)bit characters in indexable
>> strings, point 2)
>> 7b) Runtime access to the Unicode Character Database (point 3)
>> 7c) Converters
>> 7d) Normalization (point 4)
>> 7e) CLDR access (point 6)
>> 
>> 
>> 8) The implementation should be driven by the current needs.
>> 
>> An attainable next goal is to release
>> 
>> 8a) a StringBuilder utility class for easier construction of test strings
>> i.e. instead of
>> 
>>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>>> String).
>> 
>> do
>> normalizer composeString:
>> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')
>> 
>> and construct some test cases with it which illustrate some basic
>> Unicode issues.
>> 
>> 8b) identity testing for major languages (e.g. French, German,
>> Spanish) and scripts of similar complexity. I
>> 
>> 8c) to provide some more documentation of past and concurrent efforts.
>> 
>> Note: This summary has only covered string manipulation, not rendering
>> on the screen which is a different issue.
>> 
>> 
>> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote:
>>> Hi Hannes,
>>> 
>>> My detailed comments/answers below, after quoting 2 of your emails:
>>> 
>>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote:
>>>> 
>>>> Hello Sven
>>>> 
>>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote:
>>>> 
>>>>> The simplest example in a common language is (the French letter é) is
>>>>> 
>>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>>> 
>>>>> which can also be written as
>>>>> 
>>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>>>> [U+0301]
>>>>> 
>>>>> The former being a composed normal form, the latter a decomposed normal
>>>>> form. (And yes, it is even much more complicated than that, it goes on
>>>>> for
>>>>> 1000s of pages).
>>>>> 
>>>>> In the above example, the concept of character/string is indeed fuzzy.
>>>>> 
>>>>> HTH,
>>>>> 
>>>>> Sven
>>>> 
>>>> Thanks for this example. I have created a wiki page with it
>>>> 
>>>> I wonder what the Pharo equivalent is of the following Squeak expression
>>>> 
>>>>    $é asString asDecomposedUnicode
>>>> 
>>>> Regards
>>>> 
>>>> Hannes
>>> 
>>> You also wrote:
>>> 
>>>> The text below shows how to deal with the  Unicode e acute example
>>>> brought up by Sven in terms of comparing strings. Currently Pharo and
>>>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>>>> It will be shown how NFD normalization may be implemented.
>>>> 
>>>> 
>>>> Swift programming language
>>>> -----------------------------------------
>>>> 
>>>> How does the Swift programming language [1] deal with Unicode strings?
>>>> 
>>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>>>    let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>>> 
>>>>    // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>>>> COMBINING ACUTE ACCENT
>>>>    let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>>> 
>>>>    if eAcuteQuestion == combinedEAcuteQuestion {
>>>>    print("These two strings are considered equal")
>>>>    }
>>>>    // prints "These two strings are considered equal"
>>>> 
>>>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>>>> form for the comparison appyling a method
>>>> #decomposedStringWithCanonicalMapping[3]
>>>> 
>>>> 
>>>> Squeak / Pharo
>>>> -----------------------
>>>> 
>>>> Comparison without NFD [3]
>>>> 
>>>> 
>>>> "Voulez-vous un café?"
>>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>>> asString, '?'.
>>>> 
>>>> 
>>>> eAcuteQuestion = combinedEAcuteQuestion
>>>> false
>>>> 
>>>> eAcuteQuestion == combinedEAcuteQuestion
>>>> false
>>>> 
>>>> The result is false. A Unicode conformant application however should
>>>> return *true*.
>>>> 
>>>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>>>> before  testing for equality =
>>>> 
>>>> 
>>>> Squeak Unicode strings may be tested for Unicode conformant equality
>>>> by converting them to NFD before testing.
>>>> 
>>>> 
>>>> 
>>>> Squeak using NFD
>>>> 
>>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>>>> Unicode code point if decomposed, is decomposed only to two code
>>>> points [5]. This is so because when initializing [6] the Unicode
>>>> Character Database in Squeak this is a limitation imposed by the code
>>>> which reads UnicodeData.txt [7][8]. This is not a necessary
>>>> limitation. The code may be rewritten at the price of a more complex
>>>> implementation of #asDecomposedUnicode.
>>>> 
>>>> "Voulez-vous un café?"
>>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>>> asString, '?'.
>>>> 
>>>> 
>>>> eAcuteQuestion asDecomposedUnicode =
>>>>    combinedEAcuteQuestion  asDecomposedUnicode
>>>> 
>>>> true
>>>> 
>>>> 
>>>> 
>>>> Conclusion
>>>> ------------------
>>>> 
>>>> Implementing a method like #decomposedStringWithCanonicalMapping
>>>> (swift) which puts a string into NFD (Normalization Form D) is an
>>>> important building block towards better Unicode compliance. A Squeak
>>>> proposal is given by [4]. It needs to be reviewed.extended.
>>>> 
>>>> It should probably  be extended for cases where there are more than
>>>> two code points in the decomposed form (3 or more?)
>>>> 
>>>> The implementing of NFD comparison gives us an equality test for a
>>>> comparatively small effort for simple cases covering a large number of
>>>> use cases (Languages using the Latin script).
>>>> 
>>>> The algorithm is table driven by the UCD [8]. From this follows an
>>>> simple but important fact for conformant implementations need runtime
>>>> access to information from the Unicode Character Database [UCD][9].
>>>> 
>>>> 
>>>> [1]
>>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>>>> [2] http://www.unicode.org/glossary/#normalization_form_d
>>>> [3]
>>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>>>> [5] http://www.unicode.org/glossary/#code_point
>>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>>> [8] Unicode Character Database documentation
>>>> http://unicode.org/reports/tr44/
>>>> [9] http://www.unicode.org/reports/tr23/
>>> 
>>> 
>>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
>>> different but similar Unicode code in Squeak. These are too simple (even
>>> though they might work, partially).
>>> 
>>> The scope of the original threads is way too wide: a new string type,
>>> normalisation, collation, being cross dialect, mixing all kinds of character
>>> and encoding definitions. All interesting, but not much will come out of it.
>>> But the point that we cannot leave proper text string handling to an outside
>>> library is indeed key.
>>> 
>>> That is why a couple of people in the Pharo community (myself included)
>>> started an experimental, proof of concept, prototype project, that aims to
>>> improve Unicode support. We will announce it to a wider public when we feel
>>> we have something to show for. The goal is in the first place to understand
>>> and implement the fundamental algorithms, starting with the 4 forms of
>>> Normalisation. But we're working on collation/sorting too.
>>> 
>>> This work is of course being done for/in Pharo, using some of the facilities
>>> only available there. It probably won't be difficult to port, but we can't
>>> be bothered with probability right now.
>>> 
>>> What we started with is loading UCD data and making it available as a nice
>>> objects (30.000 of them).
>>> 
>>> So now you can do things like
>>> 
>>> $é unicodeCharacterData.
>>> 
>>>  => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>>> 
>>> $é unicodeCharacterData uppercase asCharacter.
>>> 
>>>  => "$É"
>>> 
>>> $é unicodeCharacterData decompositionMapping.
>>> 
>>>  => "#(101 769)"
>>> 
>>> There is also a cool GT Inspector view:
>>> 
>>> 
>>> 
>>> Next we started implementing a normaliser. It was rather easy to get support
>>> for simpler languages going. The next code snippets use explicit code
>>> arrays, because copying decomposed diacritics to my mail client does not
>>> work (they get automatically composed), in a Pharo Workspace this does work
>>> nicely with plain strings. The higher numbers are the diacritics.
>>> 
>>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
>>> Array.
>>> 
>>>  => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
>>> 807 97 105 115)"
>>> 
>>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
>>> as: Array.
>>> 
>>>  => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
>>> 115 97 108 108 101 101)"
>>> 
>>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
>>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>>> 
>>>  => "'les élèves Français'"
>>> 
>>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>>> String).
>>> 
>>>  => "'Düsseldorf Königsallee'"
>>> 
>>> However, the real algorithm following the official specification (and other
>>> elements of Unicode that interact with it) is way more complicated (think
>>> about all those special languages/scripts out there). We're focused on
>>> understanding/implementing that now.
>>> 
>>> Next, unit tests were added (of course). As well as a test that uses
>>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
>>> 75.000 individual test cases to check conformance to the official Unicode
>>> Normalization specification.
>>> 
>>> Right now (with super cool hangul / jamo code by Henrik), we hit the
>>> following stats:
>>> 
>>> #testNFC 16998/18593 (91.42%)
>>> #testNFD 16797/18593 (90.34%)
>>> #testNFKC 13321/18593 (71.65%)
>>> #testNFKD 16564/18593 (89.09%)
>>> 
>>> Way better than the naive implementations, but not yet there.
>>> 
>>> We are also experimenting and thinking a lot about how to best implement all
>>> this, trying out different models/ideas/apis/representations.
>>> 
>>> It will move slowly, but you will hear from us again in the coming
>>> weeks/months.
>>> 
>>> Sven
>>> 
>>> PS: Pharo developers with a good understanding of this subject area that
>>> want to help, let me know and we'll put you in the loop. Hacking and
>>> specification reading are required ;-)
>>> 
>>> 
>> 
>> .
>>

Re: [Pharo-dev] [squeak-dev] [Unicode] Summary (Re: Unicode Support // e acute example --> decomposition in Pharo?)

Reply via email to