Andres, There are no plans at all to drop any of the existing character encodings from Pharo. UTF-16 LE & BE will remain part of the standard image, as are all single byte encodings. No need to worry.
Sven > On 19 Dec 2015, at 03:04, Andres Valloud <[email protected]> > wrote: > > So a lot of Windows APIs require UTF-16. What's up with UTF-8 being the only > choice mentioned for external communication? > > Unicode string encodings like UTF-* and strings of "characters" (that is, > sequences of Unicode code points) should be clearly distinguished. Do you > really mean "UTF-32", or do you mean "UCS-4"? Even those two are not exactly > the same. > > On 12/18/15 5:47 , H. Hirzel wrote: >> Hello Sven >> >> Thank you for your report about about your experimental, proof of >> concept, prototype project, that aims to improve Unicode support. >> Please include me in the loop. >> >> Below is is my attempt at summarizing the Unicode discussion of the last >> weeks. >> Corrections /comments / additions are welcome. >> >> Kind regards >> >> Hannes >> >> >> 1) There is a need for improved Unicode support implemented _within_ >> the image , probably as a library. >> >> 1a) This follows the example of the the Twitter CLDR library (i.e. >> re-implementation of ICU components for Ruby). >> https://github.com/twitter/twitter-cldr-rb >> >> Other languages/libraries have similar approaches >> - dotNet, >> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) >> - Python https://docs.python.org/3/howto/unicode.html >> - Go http://blog.golang.org/strings >> - Swift, >> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html >> - Perl http://blog.golang.org/strings >> >> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This >> is because of security and portability reasons (Eliot Miranda) and >> because of the Smalltalk approach that wants to expose basic >> algorithms in Smalltalk code. In addition the 16bit based ICU library >> does not fit well with the Squeak/Pharo UTF32 model. >> >> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate >> objects, use of UTF-32 internally, indexable strings, UTF8 for outside >> communication, support of code converters) is a very valuable >> foundation which makes algorithms more straightforward at the expense >> of a more memory usage. It not used to its full potential at all >> currently though a lot of hard work has been done. >> >> 3) The Unicode algorithms are mostly table / database driven. This >> means that dictionary lookup is a prominent part of the algorithms. >> The essential building block for this is that the Unicode character >> database UCD (http://www.unicode.org/ucd/) is made available >> _within_ the image with the full content as needed by the target >> languages / scripts one wants to deal with. The process of loading the >> UCD should be made configurable. >> >> 3a) a lot of people are interested in the Latin script (and scripts of >> similar complexity) only. >> 3b) The UCD data in XML form >> http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with >> and without the CJK characters. >> >> 4) The next step is to implement normalization >> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that >> you have reached results here with the test data: >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. >> >> 5) Pharo offers nice inspectors to view dictionaries and ordered >> collections (table view, drill down) which facilitates the development >> to table driven algorithms. The data structures and algorithm are do >> not depend on a particular dialect though and may be ported to Squeak >> or Cuis. >> >> 6) After having implemented normalization, comparison may be >> implemented. This needs CLDR access (collation, Unicode Common Locale >> Data Repository, http://cldr.unicode.org/ ). >> >> >> 7) An architecture has the following subsystems >> >> 7a) Basic character handling (21(32)bit characters in indexable >> strings, point 2) >> 7b) Runtime access to the Unicode Character Database (point 3) >> 7c) Converters >> 7d) Normalization (point 4) >> 7e) CLDR access (point 6) >> >> >> 8) The implementation should be driven by the current needs. >> >> An attainable next goal is to release >> >> 8a) a StringBuilder utility class for easier construction of test strings >> i.e. instead of >> >>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >>> String). >> >> do >> normalizer composeString: >> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') >> >> and construct some test cases with it which illustrate some basic >> Unicode issues. >> >> 8b) identity testing for major languages (e.g. French, German, >> Spanish) and scripts of similar complexity. I >> >> 8c) to provide some more documentation of past and concurrent efforts. >> >> Note: This summary has only covered string manipulation, not rendering >> on the screen which is a different issue. >> >> >> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote: >>> Hi Hannes, >>> >>> My detailed comments/answers below, after quoting 2 of your emails: >>> >>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote: >>>> >>>> Hello Sven >>>> >>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote: >>>> >>>>> The simplest example in a common language is (the French letter é) is >>>>> >>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>>> >>>>> which can also be written as >>>>> >>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>>> [U+0301] >>>>> >>>>> The former being a composed normal form, the latter a decomposed normal >>>>> form. (And yes, it is even much more complicated than that, it goes on >>>>> for >>>>> 1000s of pages). >>>>> >>>>> In the above example, the concept of character/string is indeed fuzzy. >>>>> >>>>> HTH, >>>>> >>>>> Sven >>>> >>>> Thanks for this example. I have created a wiki page with it >>>> >>>> I wonder what the Pharo equivalent is of the following Squeak expression >>>> >>>> $é asString asDecomposedUnicode >>>> >>>> Regards >>>> >>>> Hannes >>> >>> You also wrote: >>> >>>> The text below shows how to deal with the Unicode e acute example >>>> brought up by Sven in terms of comparing strings. Currently Pharo and >>>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>>> It will be shown how NFD normalization may be implemented. >>>> >>>> >>>> Swift programming language >>>> ----------------------------------------- >>>> >>>> How does the Swift programming language [1] deal with Unicode strings? >>>> >>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>>> >>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>>> COMBINING ACUTE ACCENT >>>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>>> >>>> if eAcuteQuestion == combinedEAcuteQuestion { >>>> print("These two strings are considered equal") >>>> } >>>> // prints "These two strings are considered equal" >>>> >>>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>>> form for the comparison appyling a method >>>> #decomposedStringWithCanonicalMapping[3] >>>> >>>> >>>> Squeak / Pharo >>>> ----------------------- >>>> >>>> Comparison without NFD [3] >>>> >>>> >>>> "Voulez-vous un café?" >>>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>>> asString, '?'. >>>> >>>> >>>> eAcuteQuestion = combinedEAcuteQuestion >>>> false >>>> >>>> eAcuteQuestion == combinedEAcuteQuestion >>>> false >>>> >>>> The result is false. A Unicode conformant application however should >>>> return *true*. >>>> >>>> Reason for this is that Squeak / Pharo strings are not put into NFD >>>> before testing for equality = >>>> >>>> >>>> Squeak Unicode strings may be tested for Unicode conformant equality >>>> by converting them to NFD before testing. >>>> >>>> >>>> >>>> Squeak using NFD >>>> >>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>>> Unicode code point if decomposed, is decomposed only to two code >>>> points [5]. This is so because when initializing [6] the Unicode >>>> Character Database in Squeak this is a limitation imposed by the code >>>> which reads UnicodeData.txt [7][8]. This is not a necessary >>>> limitation. The code may be rewritten at the price of a more complex >>>> implementation of #asDecomposedUnicode. >>>> >>>> "Voulez-vous un café?" >>>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>>> asString, '?'. >>>> >>>> >>>> eAcuteQuestion asDecomposedUnicode = >>>> combinedEAcuteQuestion asDecomposedUnicode >>>> >>>> true >>>> >>>> >>>> >>>> Conclusion >>>> ------------------ >>>> >>>> Implementing a method like #decomposedStringWithCanonicalMapping >>>> (swift) which puts a string into NFD (Normalization Form D) is an >>>> important building block towards better Unicode compliance. A Squeak >>>> proposal is given by [4]. It needs to be reviewed.extended. >>>> >>>> It should probably be extended for cases where there are more than >>>> two code points in the decomposed form (3 or more?) >>>> >>>> The implementing of NFD comparison gives us an equality test for a >>>> comparatively small effort for simple cases covering a large number of >>>> use cases (Languages using the Latin script). >>>> >>>> The algorithm is table driven by the UCD [8]. From this follows an >>>> simple but important fact for conformant implementations need runtime >>>> access to information from the Unicode Character Database [UCD][9]. >>>> >>>> >>>> [1] >>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>>> [2] http://www.unicode.org/glossary/#normalization_form_d >>>> [3] >>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>>> [5] http://www.unicode.org/glossary/#code_point >>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>>> [8] Unicode Character Database documentation >>>> http://unicode.org/reports/tr44/ >>>> [9] http://www.unicode.org/reports/tr23/ >>> >>> >>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >>> different but similar Unicode code in Squeak. These are too simple (even >>> though they might work, partially). >>> >>> The scope of the original threads is way too wide: a new string type, >>> normalisation, collation, being cross dialect, mixing all kinds of character >>> and encoding definitions. All interesting, but not much will come out of it. >>> But the point that we cannot leave proper text string handling to an outside >>> library is indeed key. >>> >>> That is why a couple of people in the Pharo community (myself included) >>> started an experimental, proof of concept, prototype project, that aims to >>> improve Unicode support. We will announce it to a wider public when we feel >>> we have something to show for. The goal is in the first place to understand >>> and implement the fundamental algorithms, starting with the 4 forms of >>> Normalisation. But we're working on collation/sorting too. >>> >>> This work is of course being done for/in Pharo, using some of the facilities >>> only available there. It probably won't be difficult to port, but we can't >>> be bothered with probability right now. >>> >>> What we started with is loading UCD data and making it available as a nice >>> objects (30.000 of them). >>> >>> So now you can do things like >>> >>> $é unicodeCharacterData. >>> >>> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >>> >>> $é unicodeCharacterData uppercase asCharacter. >>> >>> => "$É" >>> >>> $é unicodeCharacterData decompositionMapping. >>> >>> => "#(101 769)" >>> >>> There is also a cool GT Inspector view: >>> >>> >>> >>> Next we started implementing a normaliser. It was rather easy to get support >>> for simpler languages going. The next code snippets use explicit code >>> arrays, because copying decomposed diacritics to my mail client does not >>> work (they get automatically composed), in a Pharo Workspace this does work >>> nicely with plain strings. The higher numbers are the diacritics. >>> >>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >>> Array. >>> >>> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >>> 807 97 105 115)" >>> >>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >>> as: Array. >>> >>> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >>> 115 97 108 108 101 101)" >>> >>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >>> >>> => "'les élèves Français'" >>> >>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >>> String). >>> >>> => "'Düsseldorf Königsallee'" >>> >>> However, the real algorithm following the official specification (and other >>> elements of Unicode that interact with it) is way more complicated (think >>> about all those special languages/scripts out there). We're focused on >>> understanding/implementing that now. >>> >>> Next, unit tests were added (of course). As well as a test that uses >>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >>> 75.000 individual test cases to check conformance to the official Unicode >>> Normalization specification. >>> >>> Right now (with super cool hangul / jamo code by Henrik), we hit the >>> following stats: >>> >>> #testNFC 16998/18593 (91.42%) >>> #testNFD 16797/18593 (90.34%) >>> #testNFKC 13321/18593 (71.65%) >>> #testNFKD 16564/18593 (89.09%) >>> >>> Way better than the naive implementations, but not yet there. >>> >>> We are also experimenting and thinking a lot about how to best implement all >>> this, trying out different models/ideas/apis/representations. >>> >>> It will move slowly, but you will hear from us again in the coming >>> weeks/months. >>> >>> Sven >>> >>> PS: Pharo developers with a good understanding of this subject area that >>> want to help, let me know and we'll put you in the loop. Hacking and >>> specification reading are required ;-) >>> >>> >> >> . >> > On 19 Dec 2015, at 03:04, Andres Valloud <[email protected]> > wrote: > > So a lot of Windows APIs require UTF-16. What's up with UTF-8 being the only > choice mentioned for external communication? > > Unicode string encodings like UTF-* and strings of "characters" (that is, > sequences of Unicode code points) should be clearly distinguished. Do you > really mean "UTF-32", or do you mean "UCS-4"? Even those two are not exactly > the same. > > On 12/18/15 5:47 , H. Hirzel wrote: >> Hello Sven >> >> Thank you for your report about about your experimental, proof of >> concept, prototype project, that aims to improve Unicode support. >> Please include me in the loop. >> >> Below is is my attempt at summarizing the Unicode discussion of the last >> weeks. >> Corrections /comments / additions are welcome. >> >> Kind regards >> >> Hannes >> >> >> 1) There is a need for improved Unicode support implemented _within_ >> the image , probably as a library. >> >> 1a) This follows the example of the the Twitter CLDR library (i.e. >> re-implementation of ICU components for Ruby). >> https://github.com/twitter/twitter-cldr-rb >> >> Other languages/libraries have similar approaches >> - dotNet, >> https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) >> - Python https://docs.python.org/3/howto/unicode.html >> - Go http://blog.golang.org/strings >> - Swift, >> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html >> - Perl http://blog.golang.org/strings >> >> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This >> is because of security and portability reasons (Eliot Miranda) and >> because of the Smalltalk approach that wants to expose basic >> algorithms in Smalltalk code. In addition the 16bit based ICU library >> does not fit well with the Squeak/Pharo UTF32 model. >> >> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate >> objects, use of UTF-32 internally, indexable strings, UTF8 for outside >> communication, support of code converters) is a very valuable >> foundation which makes algorithms more straightforward at the expense >> of a more memory usage. It not used to its full potential at all >> currently though a lot of hard work has been done. >> >> 3) The Unicode algorithms are mostly table / database driven. This >> means that dictionary lookup is a prominent part of the algorithms. >> The essential building block for this is that the Unicode character >> database UCD (http://www.unicode.org/ucd/) is made available >> _within_ the image with the full content as needed by the target >> languages / scripts one wants to deal with. The process of loading the >> UCD should be made configurable. >> >> 3a) a lot of people are interested in the Latin script (and scripts of >> similar complexity) only. >> 3b) The UCD data in XML form >> http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with >> and without the CJK characters. >> >> 4) The next step is to implement normalization >> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that >> you have reached results here with the test data: >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. >> >> 5) Pharo offers nice inspectors to view dictionaries and ordered >> collections (table view, drill down) which facilitates the development >> to table driven algorithms. The data structures and algorithm are do >> not depend on a particular dialect though and may be ported to Squeak >> or Cuis. >> >> 6) After having implemented normalization, comparison may be >> implemented. This needs CLDR access (collation, Unicode Common Locale >> Data Repository, http://cldr.unicode.org/ ). >> >> >> 7) An architecture has the following subsystems >> >> 7a) Basic character handling (21(32)bit characters in indexable >> strings, point 2) >> 7b) Runtime access to the Unicode Character Database (point 3) >> 7c) Converters >> 7d) Normalization (point 4) >> 7e) CLDR access (point 6) >> >> >> 8) The implementation should be driven by the current needs. >> >> An attainable next goal is to release >> >> 8a) a StringBuilder utility class for easier construction of test strings >> i.e. instead of >> >>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >>> String). >> >> do >> normalizer composeString: >> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') >> >> and construct some test cases with it which illustrate some basic >> Unicode issues. >> >> 8b) identity testing for major languages (e.g. French, German, >> Spanish) and scripts of similar complexity. I >> >> 8c) to provide some more documentation of past and concurrent efforts. >> >> Note: This summary has only covered string manipulation, not rendering >> on the screen which is a different issue. >> >> >> On 12/16/15, Sven Van Caekenberghe <[email protected]> wrote: >>> Hi Hannes, >>> >>> My detailed comments/answers below, after quoting 2 of your emails: >>> >>>> On 10 Dec 2015, at 22:17, H. Hirzel <[email protected]> wrote: >>>> >>>> Hello Sven >>>> >>>> On 12/9/15, Sven Van Caekenberghe <[email protected]> wrote: >>>> >>>>> The simplest example in a common language is (the French letter é) is >>>>> >>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>>> >>>>> which can also be written as >>>>> >>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>>> [U+0301] >>>>> >>>>> The former being a composed normal form, the latter a decomposed normal >>>>> form. (And yes, it is even much more complicated than that, it goes on >>>>> for >>>>> 1000s of pages). >>>>> >>>>> In the above example, the concept of character/string is indeed fuzzy. >>>>> >>>>> HTH, >>>>> >>>>> Sven >>>> >>>> Thanks for this example. I have created a wiki page with it >>>> >>>> I wonder what the Pharo equivalent is of the following Squeak expression >>>> >>>> $é asString asDecomposedUnicode >>>> >>>> Regards >>>> >>>> Hannes >>> >>> You also wrote: >>> >>>> The text below shows how to deal with the Unicode e acute example >>>> brought up by Sven in terms of comparing strings. Currently Pharo and >>>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>>> It will be shown how NFD normalization may be implemented. >>>> >>>> >>>> Swift programming language >>>> ----------------------------------------- >>>> >>>> How does the Swift programming language [1] deal with Unicode strings? >>>> >>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>>> >>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>>> COMBINING ACUTE ACCENT >>>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>>> >>>> if eAcuteQuestion == combinedEAcuteQuestion { >>>> print("These two strings are considered equal") >>>> } >>>> // prints "These two strings are considered equal" >>>> >>>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>>> form for the comparison appyling a method >>>> #decomposedStringWithCanonicalMapping[3] >>>> >>>> >>>> Squeak / Pharo >>>> ----------------------- >>>> >>>> Comparison without NFD [3] >>>> >>>> >>>> "Voulez-vous un café?" >>>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>>> asString, '?'. >>>> >>>> >>>> eAcuteQuestion = combinedEAcuteQuestion >>>> false >>>> >>>> eAcuteQuestion == combinedEAcuteQuestion >>>> false >>>> >>>> The result is false. A Unicode conformant application however should >>>> return *true*. >>>> >>>> Reason for this is that Squeak / Pharo strings are not put into NFD >>>> before testing for equality = >>>> >>>> >>>> Squeak Unicode strings may be tested for Unicode conformant equality >>>> by converting them to NFD before testing. >>>> >>>> >>>> >>>> Squeak using NFD >>>> >>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>>> Unicode code point if decomposed, is decomposed only to two code >>>> points [5]. This is so because when initializing [6] the Unicode >>>> Character Database in Squeak this is a limitation imposed by the code >>>> which reads UnicodeData.txt [7][8]. This is not a necessary >>>> limitation. The code may be rewritten at the price of a more complex >>>> implementation of #asDecomposedUnicode. >>>> >>>> "Voulez-vous un café?" >>>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>>> asString, '?'. >>>> >>>> >>>> eAcuteQuestion asDecomposedUnicode = >>>> combinedEAcuteQuestion asDecomposedUnicode >>>> >>>> true >>>> >>>> >>>> >>>> Conclusion >>>> ------------------ >>>> >>>> Implementing a method like #decomposedStringWithCanonicalMapping >>>> (swift) which puts a string into NFD (Normalization Form D) is an >>>> important building block towards better Unicode compliance. A Squeak >>>> proposal is given by [4]. It needs to be reviewed.extended. >>>> >>>> It should probably be extended for cases where there are more than >>>> two code points in the decomposed form (3 or more?) >>>> >>>> The implementing of NFD comparison gives us an equality test for a >>>> comparatively small effort for simple cases covering a large number of >>>> use cases (Languages using the Latin script). >>>> >>>> The algorithm is table driven by the UCD [8]. From this follows an >>>> simple but important fact for conformant implementations need runtime >>>> access to information from the Unicode Character Database [UCD][9]. >>>> >>>> >>>> [1] >>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>>> [2] http://www.unicode.org/glossary/#normalization_form_d >>>> [3] >>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>>> [5] http://www.unicode.org/glossary/#code_point >>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>>> [8] Unicode Character Database documentation >>>> http://unicode.org/reports/tr44/ >>>> [9] http://www.unicode.org/reports/tr23/ >>> >>> >>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >>> different but similar Unicode code in Squeak. These are too simple (even >>> though they might work, partially). >>> >>> The scope of the original threads is way too wide: a new string type, >>> normalisation, collation, being cross dialect, mixing all kinds of character >>> and encoding definitions. All interesting, but not much will come out of it. >>> But the point that we cannot leave proper text string handling to an outside >>> library is indeed key. >>> >>> That is why a couple of people in the Pharo community (myself included) >>> started an experimental, proof of concept, prototype project, that aims to >>> improve Unicode support. We will announce it to a wider public when we feel >>> we have something to show for. The goal is in the first place to understand >>> and implement the fundamental algorithms, starting with the 4 forms of >>> Normalisation. But we're working on collation/sorting too. >>> >>> This work is of course being done for/in Pharo, using some of the facilities >>> only available there. It probably won't be difficult to port, but we can't >>> be bothered with probability right now. >>> >>> What we started with is loading UCD data and making it available as a nice >>> objects (30.000 of them). >>> >>> So now you can do things like >>> >>> $é unicodeCharacterData. >>> >>> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >>> >>> $é unicodeCharacterData uppercase asCharacter. >>> >>> => "$É" >>> >>> $é unicodeCharacterData decompositionMapping. >>> >>> => "#(101 769)" >>> >>> There is also a cool GT Inspector view: >>> >>> >>> >>> Next we started implementing a normaliser. It was rather easy to get support >>> for simpler languages going. The next code snippets use explicit code >>> arrays, because copying decomposed diacritics to my mail client does not >>> work (they get automatically composed), in a Pharo Workspace this does work >>> nicely with plain strings. The higher numbers are the diacritics. >>> >>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >>> Array. >>> >>> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >>> 807 97 105 115)" >>> >>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >>> as: Array. >>> >>> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >>> 115 97 108 108 101 101)" >>> >>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >>> >>> => "'les élèves Français'" >>> >>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >>> String). >>> >>> => "'Düsseldorf Königsallee'" >>> >>> However, the real algorithm following the official specification (and other >>> elements of Unicode that interact with it) is way more complicated (think >>> about all those special languages/scripts out there). We're focused on >>> understanding/implementing that now. >>> >>> Next, unit tests were added (of course). As well as a test that uses >>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >>> 75.000 individual test cases to check conformance to the official Unicode >>> Normalization specification. >>> >>> Right now (with super cool hangul / jamo code by Henrik), we hit the >>> following stats: >>> >>> #testNFC 16998/18593 (91.42%) >>> #testNFD 16797/18593 (90.34%) >>> #testNFKC 13321/18593 (71.65%) >>> #testNFKD 16564/18593 (89.09%) >>> >>> Way better than the naive implementations, but not yet there. >>> >>> We are also experimenting and thinking a lot about how to best implement all >>> this, trying out different models/ideas/apis/representations. >>> >>> It will move slowly, but you will hear from us again in the coming >>> weeks/months. >>> >>> Sven >>> >>> PS: Pharo developers with a good understanding of this subject area that >>> want to help, let me know and we'll put you in the loop. Hacking and >>> specification reading are required ;-) >>> >>> >> >> . >>
