Hi Hannes,

> On 18 Dec 2015, at 14:47, H. Hirzel <[email protected]> wrote:
> 
> Hello Sven
> 
> Thank you for your report about about  your experimental, proof of
> concept, prototype project, that aims to improve Unicode support.
> Please include me in the loop.
> 
> Below is is my attempt at summarizing the Unicode discussion of the last 
> weeks.

Excellent summary.

> Corrections /comments / additions are welcome.

Not really, it is pretty accurate.

We are working on all parts of your architecture (point 7 below), with a focus 
on correct algorithms, not new representations of strings (the current ones are 
mostly fine).

I can tell you that right now, we are already at 100% of the tests that use 
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about 75.000 
individual test cases to check conformance to the official Unicode 
Normalization specification. So, NFD, NFKD, NFC and NFKC are done now, in 
principle.

I am reasonably sure that within a week or two, after we clean up and document 
everything a bit, we'll be able to announce and show the first results.

Happy Holidays !

Sven

> Kind regards
> 
> Hannes
> 
> 
> 1) There is a need for improved Unicode support implemented _within_
> the image , probably as a library.
> 
> 1a) This follows the example of the the Twitter CLDR library (i.e.
> re-implementation of ICU components for Ruby).
> https://github.com/twitter/twitter-cldr-rb
> 
> Other languages/libraries have similar approaches
> - dotNet, 
> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
> - Python https://docs.python.org/3/howto/unicode.html
> - Go http://blog.golang.org/strings
> - Swift, 
> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
> - Perl http://blog.golang.org/strings
> 
> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
> is because of security and portability reasons (Eliot Miranda) and
> because of the Smalltalk approach that wants to expose basic
> algorithms in Smalltalk code. In addition the 16bit based ICU library
> does not fit well with the Squeak/Pharo UTF32 model.
> 
> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate
> objects, use of UTF-32 internally, indexable strings, UTF8 for outside
> communication, support of code converters) is a very valuable
> foundation which makes algorithms more straightforward at the expense
> of a more memory usage. It not used to its full potential at all
> currently though a lot of hard work has been done.
> 
> 3) The Unicode algorithms are mostly table / database driven. This
> means that dictionary lookup is a prominent part of the algorithms.
> The essential building block for this is that the Unicode character
> database UCD  (http://www.unicode.org/ucd/) is made  available
> _within_ the image with the full content as needed by the target
> languages / scripts one wants to deal with. The process of loading the
> UCD should be made configurable.
> 
> 3a) a lot of people are interested in the Latin script (and scripts of
> similar complexity) only.
> 3b) The UCD data in XML form
> http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
> and without the CJK characters.
> 
> 4) The next step is to implement normalization
> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
> you have reached results here with the test data:
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.
> 
> 5) Pharo offers nice inspectors to view dictionaries and ordered
> collections (table view, drill down) which facilitates the development
> to table driven algorithms. The data structures and algorithm are do
> not depend on a particular dialect though and may be ported to Squeak
> or Cuis.
> 
> 6) After having implemented normalization, comparison may be
> implemented. This needs CLDR access (collation, Unicode Common Locale
> Data Repository, http://cldr.unicode.org/ ).
> 
> 
> 7) An architecture has the following subsystems
> 
> 7a) Basic character handling (21(32)bit characters in indexable
> strings, point 2)
> 7b) Runtime access to the Unicode Character Database (point 3)
> 7c) Converters
> 7d) Normalization (point 4)
> 7e) CLDR access (point 6)
> 
> 
> 8) The implementation should be driven by the current needs.
> 
> An attainable next goal is to release
> 
> 8a) a StringBuilder utility class for easier construction of test strings
> i.e. instead of
> 
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
> 
> do
> normalizer composeString:
> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')
> 
> and construct some test cases with it which illustrate some basic
> Unicode issues.
> 
> 8b) identity testing for major languages (e.g. French, German,
> Spanish) and scripts of similar complexity. I
> 
> 8c) to provide some more documentation of past and concurrent efforts.
> 
> Note: This summary has only covered string manipulation, not rendering
> on the screen which is a different issue.
> 
> 
> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote:
>> Hi Hannes,
>> 
>> My detailed comments/answers below, after quoting 2 of your emails:
>> 
>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote:
>>> 
>>> Hello Sven
>>> 
>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote:
>>> 
>>>> The simplest example in a common language is (the French letter é) is
>>>> 
>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>> 
>>>> which can also be written as
>>>> 
>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>>> [U+0301]
>>>> 
>>>> The former being a composed normal form, the latter a decomposed normal
>>>> form. (And yes, it is even much more complicated than that, it goes on
>>>> for
>>>> 1000s of pages).
>>>> 
>>>> In the above example, the concept of character/string is indeed fuzzy.
>>>> 
>>>> HTH,
>>>> 
>>>> Sven
>>> 
>>> Thanks for this example. I have created a wiki page with it
>>> 
>>> I wonder what the Pharo equivalent is of the following Squeak expression
>>> 
>>>  $é asString asDecomposedUnicode
>>> 
>>> Regards
>>> 
>>> Hannes
>> 
>> You also wrote:
>> 
>>> The text below shows how to deal with the  Unicode e acute example
>>> brought up by Sven in terms of comparing strings. Currently Pharo and
>>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>>> It will be shown how NFD normalization may be implemented.
>>> 
>>> 
>>> Swift programming language
>>> -----------------------------------------
>>> 
>>> How does the Swift programming language [1] deal with Unicode strings?
>>> 
>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>>  let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>> 
>>>  // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>>> COMBINING ACUTE ACCENT
>>>  let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>> 
>>>  if eAcuteQuestion == combinedEAcuteQuestion {
>>>  print("These two strings are considered equal")
>>>  }
>>>  // prints "These two strings are considered equal"
>>> 
>>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>>> form for the comparison appyling a method
>>> #decomposedStringWithCanonicalMapping[3]
>>> 
>>> 
>>> Squeak / Pharo
>>> -----------------------
>>> 
>>> Comparison without NFD [3]
>>> 
>>> 
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>> 
>>> 
>>> eAcuteQuestion = combinedEAcuteQuestion
>>> false
>>> 
>>> eAcuteQuestion == combinedEAcuteQuestion
>>> false
>>> 
>>> The result is false. A Unicode conformant application however should
>>> return *true*.
>>> 
>>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>>> before  testing for equality =
>>> 
>>> 
>>> Squeak Unicode strings may be tested for Unicode conformant equality
>>> by converting them to NFD before testing.
>>> 
>>> 
>>> 
>>> Squeak using NFD
>>> 
>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>>> Unicode code point if decomposed, is decomposed only to two code
>>> points [5]. This is so because when initializing [6] the Unicode
>>> Character Database in Squeak this is a limitation imposed by the code
>>> which reads UnicodeData.txt [7][8]. This is not a necessary
>>> limitation. The code may be rewritten at the price of a more complex
>>> implementation of #asDecomposedUnicode.
>>> 
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>> 
>>> 
>>> eAcuteQuestion asDecomposedUnicode =
>>>  combinedEAcuteQuestion  asDecomposedUnicode
>>> 
>>> true
>>> 
>>> 
>>> 
>>> Conclusion
>>> ------------------
>>> 
>>> Implementing a method like #decomposedStringWithCanonicalMapping
>>> (swift) which puts a string into NFD (Normalization Form D) is an
>>> important building block towards better Unicode compliance. A Squeak
>>> proposal is given by [4]. It needs to be reviewed.extended.
>>> 
>>> It should probably  be extended for cases where there are more than
>>> two code points in the decomposed form (3 or more?)
>>> 
>>> The implementing of NFD comparison gives us an equality test for a
>>> comparatively small effort for simple cases covering a large number of
>>> use cases (Languages using the Latin script).
>>> 
>>> The algorithm is table driven by the UCD [8]. From this follows an
>>> simple but important fact for conformant implementations need runtime
>>> access to information from the Unicode Character Database [UCD][9].
>>> 
>>> 
>>> [1]
>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>>> [2] http://www.unicode.org/glossary/#normalization_form_d
>>> [3]
>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>>> [5] http://www.unicode.org/glossary/#code_point
>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>> [8] Unicode Character Database documentation
>>> http://unicode.org/reports/tr44/
>>> [9] http://www.unicode.org/reports/tr23/
>> 
>> 
>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
>> different but similar Unicode code in Squeak. These are too simple (even
>> though they might work, partially).
>> 
>> The scope of the original threads is way too wide: a new string type,
>> normalisation, collation, being cross dialect, mixing all kinds of character
>> and encoding definitions. All interesting, but not much will come out of it.
>> But the point that we cannot leave proper text string handling to an outside
>> library is indeed key.
>> 
>> That is why a couple of people in the Pharo community (myself included)
>> started an experimental, proof of concept, prototype project, that aims to
>> improve Unicode support. We will announce it to a wider public when we feel
>> we have something to show for. The goal is in the first place to understand
>> and implement the fundamental algorithms, starting with the 4 forms of
>> Normalisation. But we're working on collation/sorting too.
>> 
>> This work is of course being done for/in Pharo, using some of the facilities
>> only available there. It probably won't be difficult to port, but we can't
>> be bothered with probability right now.
>> 
>> What we started with is loading UCD data and making it available as a nice
>> objects (30.000 of them).
>> 
>> So now you can do things like
>> 
>> $é unicodeCharacterData.
>> 
>> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>> 
>> $é unicodeCharacterData uppercase asCharacter.
>> 
>> => "$É"
>> 
>> $é unicodeCharacterData decompositionMapping.
>> 
>> => "#(101 769)"
>> 
>> There is also a cool GT Inspector view:
>> 
>> 
>> 
>> Next we started implementing a normaliser. It was rather easy to get support
>> for simpler languages going. The next code snippets use explicit code
>> arrays, because copying decomposed diacritics to my mail client does not
>> work (they get automatically composed), in a Pharo Workspace this does work
>> nicely with plain strings. The higher numbers are the diacritics.
>> 
>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
>> Array.
>> 
>> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
>> 807 97 105 115)"
>> 
>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
>> as: Array.
>> 
>> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
>> 115 97 108 108 101 101)"
>> 
>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>> 
>> => "'les élèves Français'"
>> 
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
>> 
>> => "'Düsseldorf Königsallee'"
>> 
>> However, the real algorithm following the official specification (and other
>> elements of Unicode that interact with it) is way more complicated (think
>> about all those special languages/scripts out there). We're focused on
>> understanding/implementing that now.
>> 
>> Next, unit tests were added (of course). As well as a test that uses
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
>> 75.000 individual test cases to check conformance to the official Unicode
>> Normalization specification.
>> 
>> Right now (with super cool hangul / jamo code by Henrik), we hit the
>> following stats:
>> 
>> #testNFC 16998/18593 (91.42%)
>> #testNFD 16797/18593 (90.34%)
>> #testNFKC 13321/18593 (71.65%)
>> #testNFKD 16564/18593 (89.09%)
>> 
>> Way better than the naive implementations, but not yet there.
>> 
>> We are also experimenting and thinking a lot about how to best implement all
>> this, trying out different models/ideas/apis/representations.
>> 
>> It will move slowly, but you will hear from us again in the coming
>> weeks/months.
>> 
>> Sven
>> 
>> PS: Pharo developers with a good understanding of this subject area that
>> want to help, let me know and we'll put you in the loop. Hacking and
>> specification reading are required ;-)
>> 
>> 

> Kind regards
> 
> Hannes
> 
> 
> 1) There is a need for improved Unicode support implemented _within_
> the image , probably as a library.
> 
> 1a) This follows the example of the the Twitter CLDR library (i.e.
> re-implementation of ICU components for Ruby).
> https://github.com/twitter/twitter-cldr-rb
> 
> Other languages/libraries have similar approaches
> - dotNet, 
> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
> - Python https://docs.python.org/3/howto/unicode.html
> - Go http://blog.golang.org/strings
> - Swift, 
> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
> - Perl http://blog.golang.org/strings
> 
> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
> is because of security and portability reasons (Eliot Miranda) and
> because of the Smalltalk approach that wants to expose basic
> algorithms in Smalltalk code. In addition the 16bit based ICU library
> does not fit well with the Squeak/Pharo UTF32 model.
> 
> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate
> objects, use of UTF-32 internally, indexable strings, UTF8 for outside
> communication, support of code converters) is a very valuable
> foundation which makes algorithms more straightforward at the expense
> of a more memory usage. It not used to its full potential at all
> currently though a lot of hard work has been done.
> 
> 3) The Unicode algorithms are mostly table / database driven. This
> means that dictionary lookup is a prominent part of the algorithms.
> The essential building block for this is that the Unicode character
> database UCD  (http://www.unicode.org/ucd/) is made  available
> _within_ the image with the full content as needed by the target
> languages / scripts one wants to deal with. The process of loading the
> UCD should be made configurable.
> 
> 3a) a lot of people are interested in the Latin script (and scripts of
> similar complexity) only.
> 3b) The UCD data in XML form
> http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
> and without the CJK characters.
> 
> 4) The next step is to implement normalization
> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
> you have reached results here with the test data:
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.
> 
> 5) Pharo offers nice inspectors to view dictionaries and ordered
> collections (table view, drill down) which facilitates the development
> to table driven algorithms. The data structures and algorithm are do
> not depend on a particular dialect though and may be ported to Squeak
> or Cuis.
> 
> 6) After having implemented normalization, comparison may be
> implemented. This needs CLDR access (collation, Unicode Common Locale
> Data Repository, http://cldr.unicode.org/ ).
> 
> 
> 7) An architecture has the following subsystems
> 
> 7a) Basic character handling (21(32)bit characters in indexable
> strings, point 2)
> 7b) Runtime access to the Unicode Character Database (point 3)
> 7c) Converters
> 7d) Normalization (point 4)
> 7e) CLDR access (point 6)
> 
> 
> 8) The implementation should be driven by the current needs.
> 
> An attainable next goal is to release
> 
> 8a) a StringBuilder utility class for easier construction of test strings
> i.e. instead of
> 
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
> 
> do
> normalizer composeString:
> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')
> 
> and construct some test cases with it which illustrate some basic
> Unicode issues.
> 
> 8b) identity testing for major languages (e.g. French, German,
> Spanish) and scripts of similar complexity. I
> 
> 8c) to provide some more documentation of past and concurrent efforts.
> 
> Note: This summary has only covered string manipulation, not rendering
> on the screen which is a different issue.
> 
> 
> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote:
>> Hi Hannes,
>> 
>> My detailed comments/answers below, after quoting 2 of your emails:
>> 
>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote:
>>> 
>>> Hello Sven
>>> 
>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote:
>>> 
>>>> The simplest example in a common language is (the French letter é) is
>>>> 
>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>> 
>>>> which can also be written as
>>>> 
>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>>> [U+0301]
>>>> 
>>>> The former being a composed normal form, the latter a decomposed normal
>>>> form. (And yes, it is even much more complicated than that, it goes on
>>>> for
>>>> 1000s of pages).
>>>> 
>>>> In the above example, the concept of character/string is indeed fuzzy.
>>>> 
>>>> HTH,
>>>> 
>>>> Sven
>>> 
>>> Thanks for this example. I have created a wiki page with it
>>> 
>>> I wonder what the Pharo equivalent is of the following Squeak expression
>>> 
>>>   $é asString asDecomposedUnicode
>>> 
>>> Regards
>>> 
>>> Hannes
>> 
>> You also wrote:
>> 
>>> The text below shows how to deal with the  Unicode e acute example
>>> brought up by Sven in terms of comparing strings. Currently Pharo and
>>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>>> It will be shown how NFD normalization may be implemented.
>>> 
>>> 
>>> Swift programming language
>>> -----------------------------------------
>>> 
>>> How does the Swift programming language [1] deal with Unicode strings?
>>> 
>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>>   let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>> 
>>>   // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>>> COMBINING ACUTE ACCENT
>>>   let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>> 
>>>   if eAcuteQuestion == combinedEAcuteQuestion {
>>>   print("These two strings are considered equal")
>>>   }
>>>   // prints "These two strings are considered equal"
>>> 
>>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>>> form for the comparison appyling a method
>>> #decomposedStringWithCanonicalMapping[3]
>>> 
>>> 
>>> Squeak / Pharo
>>> -----------------------
>>> 
>>> Comparison without NFD [3]
>>> 
>>> 
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>> 
>>> 
>>> eAcuteQuestion = combinedEAcuteQuestion
>>> false
>>> 
>>> eAcuteQuestion == combinedEAcuteQuestion
>>> false
>>> 
>>> The result is false. A Unicode conformant application however should
>>> return *true*.
>>> 
>>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>>> before  testing for equality =
>>> 
>>> 
>>> Squeak Unicode strings may be tested for Unicode conformant equality
>>> by converting them to NFD before testing.
>>> 
>>> 
>>> 
>>> Squeak using NFD
>>> 
>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>>> Unicode code point if decomposed, is decomposed only to two code
>>> points [5]. This is so because when initializing [6] the Unicode
>>> Character Database in Squeak this is a limitation imposed by the code
>>> which reads UnicodeData.txt [7][8]. This is not a necessary
>>> limitation. The code may be rewritten at the price of a more complex
>>> implementation of #asDecomposedUnicode.
>>> 
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>> 
>>> 
>>> eAcuteQuestion asDecomposedUnicode =
>>>   combinedEAcuteQuestion  asDecomposedUnicode
>>> 
>>> true
>>> 
>>> 
>>> 
>>> Conclusion
>>> ------------------
>>> 
>>> Implementing a method like #decomposedStringWithCanonicalMapping
>>> (swift) which puts a string into NFD (Normalization Form D) is an
>>> important building block towards better Unicode compliance. A Squeak
>>> proposal is given by [4]. It needs to be reviewed.extended.
>>> 
>>> It should probably  be extended for cases where there are more than
>>> two code points in the decomposed form (3 or more?)
>>> 
>>> The implementing of NFD comparison gives us an equality test for a
>>> comparatively small effort for simple cases covering a large number of
>>> use cases (Languages using the Latin script).
>>> 
>>> The algorithm is table driven by the UCD [8]. From this follows an
>>> simple but important fact for conformant implementations need runtime
>>> access to information from the Unicode Character Database [UCD][9].
>>> 
>>> 
>>> [1]
>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>>> [2] http://www.unicode.org/glossary/#normalization_form_d
>>> [3]
>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>>> [5] http://www.unicode.org/glossary/#code_point
>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>> [8] Unicode Character Database documentation
>>> http://unicode.org/reports/tr44/
>>> [9] http://www.unicode.org/reports/tr23/
>> 
>> 
>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
>> different but similar Unicode code in Squeak. These are too simple (even
>> though they might work, partially).
>> 
>> The scope of the original threads is way too wide: a new string type,
>> normalisation, collation, being cross dialect, mixing all kinds of character
>> and encoding definitions. All interesting, but not much will come out of it.
>> But the point that we cannot leave proper text string handling to an outside
>> library is indeed key.
>> 
>> That is why a couple of people in the Pharo community (myself included)
>> started an experimental, proof of concept, prototype project, that aims to
>> improve Unicode support. We will announce it to a wider public when we feel
>> we have something to show for. The goal is in the first place to understand
>> and implement the fundamental algorithms, starting with the 4 forms of
>> Normalisation. But we're working on collation/sorting too.
>> 
>> This work is of course being done for/in Pharo, using some of the facilities
>> only available there. It probably won't be difficult to port, but we can't
>> be bothered with probability right now.
>> 
>> What we started with is loading UCD data and making it available as a nice
>> objects (30.000 of them).
>> 
>> So now you can do things like
>> 
>> $é unicodeCharacterData.
>> 
>> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>> 
>> $é unicodeCharacterData uppercase asCharacter.
>> 
>> => "$É"
>> 
>> $é unicodeCharacterData decompositionMapping.
>> 
>> => "#(101 769)"
>> 
>> There is also a cool GT Inspector view:
>> 
>> 
>> 
>> Next we started implementing a normaliser. It was rather easy to get support
>> for simpler languages going. The next code snippets use explicit code
>> arrays, because copying decomposed diacritics to my mail client does not
>> work (they get automatically composed), in a Pharo Workspace this does work
>> nicely with plain strings. The higher numbers are the diacritics.
>> 
>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
>> Array.
>> 
>> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
>> 807 97 105 115)"
>> 
>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
>> as: Array.
>> 
>> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
>> 115 97 108 108 101 101)"
>> 
>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>> 
>> => "'les élèves Français'"
>> 
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
>> 
>> => "'Düsseldorf Königsallee'"
>> 
>> However, the real algorithm following the official specification (and other
>> elements of Unicode that interact with it) is way more complicated (think
>> about all those special languages/scripts out there). We're focused on
>> understanding/implementing that now.
>> 
>> Next, unit tests were added (of course). As well as a test that uses
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
>> 75.000 individual test cases to check conformance to the official Unicode
>> Normalization specification.
>> 
>> Right now (with super cool hangul / jamo code by Henrik), we hit the
>> following stats:
>> 
>> #testNFC 16998/18593 (91.42%)
>> #testNFD 16797/18593 (90.34%)
>> #testNFKC 13321/18593 (71.65%)
>> #testNFKD 16564/18593 (89.09%)
>> 
>> Way better than the naive implementations, but not yet there.
>> 
>> We are also experimenting and thinking a lot about how to best implement all
>> this, trying out different models/ideas/apis/representations.
>> 
>> It will move slowly, but you will hear from us again in the coming
>> weeks/months.
>> 
>> Sven
>> 
>> PS: Pharo developers with a good understanding of this subject area that
>> want to help, let me know and we'll put you in the loop. Hacking and
>> specification reading are required ;-)
>> 
>> 


Reply via email to