Hi Hannes, > On 18 Dec 2015, at 14:47, H. Hirzel <[email protected]> wrote: > > Hello Sven > > Thank you for your report about about your experimental, proof of > concept, prototype project, that aims to improve Unicode support. > Please include me in the loop. > > Below is is my attempt at summarizing the Unicode discussion of the last > weeks.
Excellent summary. > Corrections /comments / additions are welcome. Not really, it is pretty accurate. We are working on all parts of your architecture (point 7 below), with a focus on correct algorithms, not new representations of strings (the current ones are mostly fine). I can tell you that right now, we are already at 100% of the tests that use http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about 75.000 individual test cases to check conformance to the official Unicode Normalization specification. So, NFD, NFKD, NFC and NFKC are done now, in principle. I am reasonably sure that within a week or two, after we clean up and document everything a bit, we'll be able to announce and show the first results. Happy Holidays ! Sven > Kind regards > > Hannes > > > 1) There is a need for improved Unicode support implemented _within_ > the image , probably as a library. > > 1a) This follows the example of the the Twitter CLDR library (i.e. > re-implementation of ICU components for Ruby). > https://github.com/twitter/twitter-cldr-rb > > Other languages/libraries have similar approaches > - dotNet, > https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) > - Python https://docs.python.org/3/howto/unicode.html > - Go http://blog.golang.org/strings > - Swift, > https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > - Perl http://blog.golang.org/strings > > 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This > is because of security and portability reasons (Eliot Miranda) and > because of the Smalltalk approach that wants to expose basic > algorithms in Smalltalk code. In addition the 16bit based ICU library > does not fit well with the Squeak/Pharo UTF32 model. > > 2) The Unicode infrastructure (21(32) bit wide Characters as immediate > objects, use of UTF-32 internally, indexable strings, UTF8 for outside > communication, support of code converters) is a very valuable > foundation which makes algorithms more straightforward at the expense > of a more memory usage. It not used to its full potential at all > currently though a lot of hard work has been done. > > 3) The Unicode algorithms are mostly table / database driven. This > means that dictionary lookup is a prominent part of the algorithms. > The essential building block for this is that the Unicode character > database UCD (http://www.unicode.org/ucd/) is made available > _within_ the image with the full content as needed by the target > languages / scripts one wants to deal with. The process of loading the > UCD should be made configurable. > > 3a) a lot of people are interested in the Latin script (and scripts of > similar complexity) only. > 3b) The UCD data in XML form > http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with > and without the CJK characters. > > 4) The next step is to implement normalization > (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that > you have reached results here with the test data: > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. > > 5) Pharo offers nice inspectors to view dictionaries and ordered > collections (table view, drill down) which facilitates the development > to table driven algorithms. The data structures and algorithm are do > not depend on a particular dialect though and may be ported to Squeak > or Cuis. > > 6) After having implemented normalization, comparison may be > implemented. This needs CLDR access (collation, Unicode Common Locale > Data Repository, http://cldr.unicode.org/ ). > > > 7) An architecture has the following subsystems > > 7a) Basic character handling (21(32)bit characters in indexable > strings, point 2) > 7b) Runtime access to the Unicode Character Database (point 3) > 7c) Converters > 7d) Normalization (point 4) > 7e) CLDR access (point 6) > > > 8) The implementation should be driven by the current needs. > > An attainable next goal is to release > > 8a) a StringBuilder utility class for easier construction of test strings > i.e. instead of > >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). > > do > normalizer composeString: > (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') > > and construct some test cases with it which illustrate some basic > Unicode issues. > > 8b) identity testing for major languages (e.g. French, German, > Spanish) and scripts of similar complexity. I > > 8c) to provide some more documentation of past and concurrent efforts. > > Note: This summary has only covered string manipulation, not rendering > on the screen which is a different issue. > > > On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote: >> Hi Hannes, >> >> My detailed comments/answers below, after quoting 2 of your emails: >> >>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote: >>> >>> Hello Sven >>> >>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote: >>> >>>> The simplest example in a common language is (the French letter é) is >>>> >>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>> >>>> which can also be written as >>>> >>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>> [U+0301] >>>> >>>> The former being a composed normal form, the latter a decomposed normal >>>> form. (And yes, it is even much more complicated than that, it goes on >>>> for >>>> 1000s of pages). >>>> >>>> In the above example, the concept of character/string is indeed fuzzy. >>>> >>>> HTH, >>>> >>>> Sven >>> >>> Thanks for this example. I have created a wiki page with it >>> >>> I wonder what the Pharo equivalent is of the following Squeak expression >>> >>> $é asString asDecomposedUnicode >>> >>> Regards >>> >>> Hannes >> >> You also wrote: >> >>> The text below shows how to deal with the Unicode e acute example >>> brought up by Sven in terms of comparing strings. Currently Pharo and >>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>> It will be shown how NFD normalization may be implemented. >>> >>> >>> Swift programming language >>> ----------------------------------------- >>> >>> How does the Swift programming language [1] deal with Unicode strings? >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>> COMBINING ACUTE ACCENT >>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>> >>> if eAcuteQuestion == combinedEAcuteQuestion { >>> print("These two strings are considered equal") >>> } >>> // prints "These two strings are considered equal" >>> >>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>> form for the comparison appyling a method >>> #decomposedStringWithCanonicalMapping[3] >>> >>> >>> Squeak / Pharo >>> ----------------------- >>> >>> Comparison without NFD [3] >>> >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion = combinedEAcuteQuestion >>> false >>> >>> eAcuteQuestion == combinedEAcuteQuestion >>> false >>> >>> The result is false. A Unicode conformant application however should >>> return *true*. >>> >>> Reason for this is that Squeak / Pharo strings are not put into NFD >>> before testing for equality = >>> >>> >>> Squeak Unicode strings may be tested for Unicode conformant equality >>> by converting them to NFD before testing. >>> >>> >>> >>> Squeak using NFD >>> >>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>> Unicode code point if decomposed, is decomposed only to two code >>> points [5]. This is so because when initializing [6] the Unicode >>> Character Database in Squeak this is a limitation imposed by the code >>> which reads UnicodeData.txt [7][8]. This is not a necessary >>> limitation. The code may be rewritten at the price of a more complex >>> implementation of #asDecomposedUnicode. >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion asDecomposedUnicode = >>> combinedEAcuteQuestion asDecomposedUnicode >>> >>> true >>> >>> >>> >>> Conclusion >>> ------------------ >>> >>> Implementing a method like #decomposedStringWithCanonicalMapping >>> (swift) which puts a string into NFD (Normalization Form D) is an >>> important building block towards better Unicode compliance. A Squeak >>> proposal is given by [4]. It needs to be reviewed.extended. >>> >>> It should probably be extended for cases where there are more than >>> two code points in the decomposed form (3 or more?) >>> >>> The implementing of NFD comparison gives us an equality test for a >>> comparatively small effort for simple cases covering a large number of >>> use cases (Languages using the Latin script). >>> >>> The algorithm is table driven by the UCD [8]. From this follows an >>> simple but important fact for conformant implementations need runtime >>> access to information from the Unicode Character Database [UCD][9]. >>> >>> >>> [1] >>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>> [2] http://www.unicode.org/glossary/#normalization_form_d >>> [3] >>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>> [5] http://www.unicode.org/glossary/#code_point >>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>> [8] Unicode Character Database documentation >>> http://unicode.org/reports/tr44/ >>> [9] http://www.unicode.org/reports/tr23/ >> >> >> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >> different but similar Unicode code in Squeak. These are too simple (even >> though they might work, partially). >> >> The scope of the original threads is way too wide: a new string type, >> normalisation, collation, being cross dialect, mixing all kinds of character >> and encoding definitions. All interesting, but not much will come out of it. >> But the point that we cannot leave proper text string handling to an outside >> library is indeed key. >> >> That is why a couple of people in the Pharo community (myself included) >> started an experimental, proof of concept, prototype project, that aims to >> improve Unicode support. We will announce it to a wider public when we feel >> we have something to show for. The goal is in the first place to understand >> and implement the fundamental algorithms, starting with the 4 forms of >> Normalisation. But we're working on collation/sorting too. >> >> This work is of course being done for/in Pharo, using some of the facilities >> only available there. It probably won't be difficult to port, but we can't >> be bothered with probability right now. >> >> What we started with is loading UCD data and making it available as a nice >> objects (30.000 of them). >> >> So now you can do things like >> >> $é unicodeCharacterData. >> >> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >> >> $é unicodeCharacterData uppercase asCharacter. >> >> => "$É" >> >> $é unicodeCharacterData decompositionMapping. >> >> => "#(101 769)" >> >> There is also a cool GT Inspector view: >> >> >> >> Next we started implementing a normaliser. It was rather easy to get support >> for simpler languages going. The next code snippets use explicit code >> arrays, because copying decomposed diacritics to my mail client does not >> work (they get automatically composed), in a Pharo Workspace this does work >> nicely with plain strings. The higher numbers are the diacritics. >> >> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >> Array. >> >> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >> 807 97 105 115)" >> >> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >> as: Array. >> >> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >> 115 97 108 108 101 101)" >> >> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >> >> => "'les élèves Français'" >> >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). >> >> => "'Düsseldorf Königsallee'" >> >> However, the real algorithm following the official specification (and other >> elements of Unicode that interact with it) is way more complicated (think >> about all those special languages/scripts out there). We're focused on >> understanding/implementing that now. >> >> Next, unit tests were added (of course). As well as a test that uses >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >> 75.000 individual test cases to check conformance to the official Unicode >> Normalization specification. >> >> Right now (with super cool hangul / jamo code by Henrik), we hit the >> following stats: >> >> #testNFC 16998/18593 (91.42%) >> #testNFD 16797/18593 (90.34%) >> #testNFKC 13321/18593 (71.65%) >> #testNFKD 16564/18593 (89.09%) >> >> Way better than the naive implementations, but not yet there. >> >> We are also experimenting and thinking a lot about how to best implement all >> this, trying out different models/ideas/apis/representations. >> >> It will move slowly, but you will hear from us again in the coming >> weeks/months. >> >> Sven >> >> PS: Pharo developers with a good understanding of this subject area that >> want to help, let me know and we'll put you in the loop. Hacking and >> specification reading are required ;-) >> >> > Kind regards > > Hannes > > > 1) There is a need for improved Unicode support implemented _within_ > the image , probably as a library. > > 1a) This follows the example of the the Twitter CLDR library (i.e. > re-implementation of ICU components for Ruby). > https://github.com/twitter/twitter-cldr-rb > > Other languages/libraries have similar approaches > - dotNet, > https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) > - Python https://docs.python.org/3/howto/unicode.html > - Go http://blog.golang.org/strings > - Swift, > https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > - Perl http://blog.golang.org/strings > > 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This > is because of security and portability reasons (Eliot Miranda) and > because of the Smalltalk approach that wants to expose basic > algorithms in Smalltalk code. In addition the 16bit based ICU library > does not fit well with the Squeak/Pharo UTF32 model. > > 2) The Unicode infrastructure (21(32) bit wide Characters as immediate > objects, use of UTF-32 internally, indexable strings, UTF8 for outside > communication, support of code converters) is a very valuable > foundation which makes algorithms more straightforward at the expense > of a more memory usage. It not used to its full potential at all > currently though a lot of hard work has been done. > > 3) The Unicode algorithms are mostly table / database driven. This > means that dictionary lookup is a prominent part of the algorithms. > The essential building block for this is that the Unicode character > database UCD (http://www.unicode.org/ucd/) is made available > _within_ the image with the full content as needed by the target > languages / scripts one wants to deal with. The process of loading the > UCD should be made configurable. > > 3a) a lot of people are interested in the Latin script (and scripts of > similar complexity) only. > 3b) The UCD data in XML form > http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with > and without the CJK characters. > > 4) The next step is to implement normalization > (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that > you have reached results here with the test data: > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. > > 5) Pharo offers nice inspectors to view dictionaries and ordered > collections (table view, drill down) which facilitates the development > to table driven algorithms. The data structures and algorithm are do > not depend on a particular dialect though and may be ported to Squeak > or Cuis. > > 6) After having implemented normalization, comparison may be > implemented. This needs CLDR access (collation, Unicode Common Locale > Data Repository, http://cldr.unicode.org/ ). > > > 7) An architecture has the following subsystems > > 7a) Basic character handling (21(32)bit characters in indexable > strings, point 2) > 7b) Runtime access to the Unicode Character Database (point 3) > 7c) Converters > 7d) Normalization (point 4) > 7e) CLDR access (point 6) > > > 8) The implementation should be driven by the current needs. > > An attainable next goal is to release > > 8a) a StringBuilder utility class for easier construction of test strings > i.e. instead of > >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). > > do > normalizer composeString: > (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') > > and construct some test cases with it which illustrate some basic > Unicode issues. > > 8b) identity testing for major languages (e.g. French, German, > Spanish) and scripts of similar complexity. I > > 8c) to provide some more documentation of past and concurrent efforts. > > Note: This summary has only covered string manipulation, not rendering > on the screen which is a different issue. > > > On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote: >> Hi Hannes, >> >> My detailed comments/answers below, after quoting 2 of your emails: >> >>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote: >>> >>> Hello Sven >>> >>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote: >>> >>>> The simplest example in a common language is (the French letter é) is >>>> >>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>> >>>> which can also be written as >>>> >>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>> [U+0301] >>>> >>>> The former being a composed normal form, the latter a decomposed normal >>>> form. (And yes, it is even much more complicated than that, it goes on >>>> for >>>> 1000s of pages). >>>> >>>> In the above example, the concept of character/string is indeed fuzzy. >>>> >>>> HTH, >>>> >>>> Sven >>> >>> Thanks for this example. I have created a wiki page with it >>> >>> I wonder what the Pharo equivalent is of the following Squeak expression >>> >>> $é asString asDecomposedUnicode >>> >>> Regards >>> >>> Hannes >> >> You also wrote: >> >>> The text below shows how to deal with the Unicode e acute example >>> brought up by Sven in terms of comparing strings. Currently Pharo and >>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>> It will be shown how NFD normalization may be implemented. >>> >>> >>> Swift programming language >>> ----------------------------------------- >>> >>> How does the Swift programming language [1] deal with Unicode strings? >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>> COMBINING ACUTE ACCENT >>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>> >>> if eAcuteQuestion == combinedEAcuteQuestion { >>> print("These two strings are considered equal") >>> } >>> // prints "These two strings are considered equal" >>> >>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>> form for the comparison appyling a method >>> #decomposedStringWithCanonicalMapping[3] >>> >>> >>> Squeak / Pharo >>> ----------------------- >>> >>> Comparison without NFD [3] >>> >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion = combinedEAcuteQuestion >>> false >>> >>> eAcuteQuestion == combinedEAcuteQuestion >>> false >>> >>> The result is false. A Unicode conformant application however should >>> return *true*. >>> >>> Reason for this is that Squeak / Pharo strings are not put into NFD >>> before testing for equality = >>> >>> >>> Squeak Unicode strings may be tested for Unicode conformant equality >>> by converting them to NFD before testing. >>> >>> >>> >>> Squeak using NFD >>> >>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>> Unicode code point if decomposed, is decomposed only to two code >>> points [5]. This is so because when initializing [6] the Unicode >>> Character Database in Squeak this is a limitation imposed by the code >>> which reads UnicodeData.txt [7][8]. This is not a necessary >>> limitation. The code may be rewritten at the price of a more complex >>> implementation of #asDecomposedUnicode. >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion asDecomposedUnicode = >>> combinedEAcuteQuestion asDecomposedUnicode >>> >>> true >>> >>> >>> >>> Conclusion >>> ------------------ >>> >>> Implementing a method like #decomposedStringWithCanonicalMapping >>> (swift) which puts a string into NFD (Normalization Form D) is an >>> important building block towards better Unicode compliance. A Squeak >>> proposal is given by [4]. It needs to be reviewed.extended. >>> >>> It should probably be extended for cases where there are more than >>> two code points in the decomposed form (3 or more?) >>> >>> The implementing of NFD comparison gives us an equality test for a >>> comparatively small effort for simple cases covering a large number of >>> use cases (Languages using the Latin script). >>> >>> The algorithm is table driven by the UCD [8]. From this follows an >>> simple but important fact for conformant implementations need runtime >>> access to information from the Unicode Character Database [UCD][9]. >>> >>> >>> [1] >>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>> [2] http://www.unicode.org/glossary/#normalization_form_d >>> [3] >>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>> [5] http://www.unicode.org/glossary/#code_point >>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>> [8] Unicode Character Database documentation >>> http://unicode.org/reports/tr44/ >>> [9] http://www.unicode.org/reports/tr23/ >> >> >> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >> different but similar Unicode code in Squeak. These are too simple (even >> though they might work, partially). >> >> The scope of the original threads is way too wide: a new string type, >> normalisation, collation, being cross dialect, mixing all kinds of character >> and encoding definitions. All interesting, but not much will come out of it. >> But the point that we cannot leave proper text string handling to an outside >> library is indeed key. >> >> That is why a couple of people in the Pharo community (myself included) >> started an experimental, proof of concept, prototype project, that aims to >> improve Unicode support. We will announce it to a wider public when we feel >> we have something to show for. The goal is in the first place to understand >> and implement the fundamental algorithms, starting with the 4 forms of >> Normalisation. But we're working on collation/sorting too. >> >> This work is of course being done for/in Pharo, using some of the facilities >> only available there. It probably won't be difficult to port, but we can't >> be bothered with probability right now. >> >> What we started with is loading UCD data and making it available as a nice >> objects (30.000 of them). >> >> So now you can do things like >> >> $é unicodeCharacterData. >> >> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >> >> $é unicodeCharacterData uppercase asCharacter. >> >> => "$É" >> >> $é unicodeCharacterData decompositionMapping. >> >> => "#(101 769)" >> >> There is also a cool GT Inspector view: >> >> >> >> Next we started implementing a normaliser. It was rather easy to get support >> for simpler languages going. The next code snippets use explicit code >> arrays, because copying decomposed diacritics to my mail client does not >> work (they get automatically composed), in a Pharo Workspace this does work >> nicely with plain strings. The higher numbers are the diacritics. >> >> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >> Array. >> >> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >> 807 97 105 115)" >> >> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >> as: Array. >> >> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >> 115 97 108 108 101 101)" >> >> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >> >> => "'les élèves Français'" >> >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). >> >> => "'Düsseldorf Königsallee'" >> >> However, the real algorithm following the official specification (and other >> elements of Unicode that interact with it) is way more complicated (think >> about all those special languages/scripts out there). We're focused on >> understanding/implementing that now. >> >> Next, unit tests were added (of course). As well as a test that uses >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >> 75.000 individual test cases to check conformance to the official Unicode >> Normalization specification. >> >> Right now (with super cool hangul / jamo code by Henrik), we hit the >> following stats: >> >> #testNFC 16998/18593 (91.42%) >> #testNFD 16797/18593 (90.34%) >> #testNFKC 13321/18593 (71.65%) >> #testNFKD 16564/18593 (89.09%) >> >> Way better than the naive implementations, but not yet there. >> >> We are also experimenting and thinking a lot about how to best implement all >> this, trying out different models/ideas/apis/representations. >> >> It will move slowly, but you will hear from us again in the coming >> weeks/months. >> >> Sven >> >> PS: Pharo developers with a good understanding of this subject area that >> want to help, let me know and we'll put you in the loop. Hacking and >> specification reading are required ;-) >> >>
