Dale - is that you can't depend on the value of a codepoint *unless the string is either in fully-composed form (or has just been fully-decomposed from a fully-composed form) *
OR are there circumstances where even those two cases cannot be relied upon? On 8 December 2015 at 19:20, Dale Henrichs <[email protected]> wrote: > > > On 12/07/2015 11:31 PM, H. Hirzel wrote: >> >> Dale >> >> Thank you for your answer with links to the ICU library and the notes >> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >> subclass of ByteArray. >> >> I understand that Gemstone uses the ICU library and thus does not >> implement the algorithms in Smalltalk. >> >> I am currently looking into what the ICU library provides. >> >> I found as well a Ruby library [2] which implements CLDR [3] >> >> It has methods like this >> >> "Alphabetize a list using regular Ruby sort:" >> >> $> ["Art", "Wasa", "Älg", "Ved"].sort >> $> ["Art", "Ved", "Wasa", "Älg"] >> >> Alphabetize a list using TwitterCLDR’s locale-aware sort: >> >> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a >> $> ["Älg", "Art", "Ved", "Wasa"] >> >> I hope that given such an example it would not be too difficult to >> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently >> the interest is in getting sorting done in a cross-dialect-way. >> > > I think that the issue (from a performance perspective) is that you can't > depend upon the value of the code point when doing collation --- the main > algorithm[5] is pretty much table based --- In addition to the different > sort orders based on characters there are even more arcane sort rules where > characters at the end of a word can affect the sort order of the word (for > more info see[4]). > > It is worth looking at the Conformance section of the Unicode spec[1] as > there are different levels of collation conformance ..... > > ICU conforms[2] to to UTS #10[3], the highest level of conformance ... > > It looks like TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7]. > They don't claim to be conformant to the Unicode Collation Algorithm[3], but > they are covering a big chunk of the standard use cases .... > > Dale > > [1] http://unicode.org/reports/tr10/#Conformance > [2] http://userguide.icu-project.org/collation > [3] http://www.unicode.org/reports/tr10/ > [4] http://www.unicode.org/reports/tr10/#Introduction > [5] http://www.unicode.org/reports/tr10/#Main_Algorithm > [6] > https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby > [7] http://unicode.org/reports/tr10/#Tailoring
