Localization requirements for Unicode collation vary widely. In the context of a file system, for example, locale-neutral collation and case folding is required. See the CIFS protocol. Generally, you have to be careful when data from different locales may be included in the sorted data set, or if metadata on the data set locale is missing.
Richard is right, however, that collation results for Unicode data can vary by locale. But the differences are few and far between. For certain Chinese locales and Japanese, there are a few minor differences between generic Unicode and the locale-specific results. I believe the "German phonebook" sort is different as well, but a German language locale does not always imply the "German phonebook" sort. Finally, case folding is easily implemented in a locale neutral way. Case differences are defined only in certain western languages and there are only a couple thousand upper/lower character pairs that do not change by locale (not true for Title casing, however). In fact, Unicode case folding is an excellent application for a perfect hash. A perfect hash function is easily generated for foldable characters. See Bob Jenkin's page on perfect hash functions (includes generator source): http://burtleburtle.net/bob/hash/perfect.html Quoting "Morse, Richard E.MGH" <[email protected]>: > On Sep 13, 2011, at 11:22 PM, Uri Guttman wrote: > >> we discussed this and it would be very easy for a user to call this on >> their keys. and it should be easy enough to add it as an option to the >> module. would anyone want to work on this for me? it would mean adding >> simple boolean option handling code for a key, generated code to load >> the Unicode::Collate module and create an instance of it (very easy) and >> then applying the getSortKey method on the extracted key value (also >> easy). the hardest thing and it isn't too hard is writing a test for it. > > One note -- in order to do this properly, you'll also need to provide > support for setting the proper locale for the Unicode::Collate module > -- either via another parameter, or by requiring a Unicode::Collate > object to be provided... > > Ricky > > > The information in this e-mail is intended only for the person to whom it is > addressed. If you believe this e-mail was sent to you in error and the e-mail > contains patient information, please contact the Partners Compliance > HelpLine at > http://www.partners.org/complianceline . If the e-mail was sent to > you in error > but does not contain patient information, please contact the sender > and properly > dispose of the e-mail. > > > _______________________________________________ > Boston-pm mailing list > [email protected] > http://mail.pm.org/mailman/listinfo/boston-pm > _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

