Mike Matrigali <[EMAIL PROTECTED]> writes: > Thanks, I have not written the like tests yet, and am looking for > examples like the following where the result under the default > system is different under collation vs default that can be added > to the junit tests, but have to admit I don't know much about > languages other than english.
I know very little about how collation is defined in the standards, but I would guess the trickiest part is the character sequences that map into a single collation element, like ch in Spanish or aa in the Scandinavian languages. Since I happen to know Norwegian fairly well, I'll try to present what I would expect, and then perhaps someone else could chime in and explain how/if those expectations map into the standards (Unicode, SQL, +++). Hopefully, this could also give you some ideas on how to write some meaningful tests. In Norwegian, the character sequence "aa" is to be treated as the single letter "å" if it is pronounced identically to "å". Since "a" is the first letter of the alphabet and "å" the last letter of the alphabet, this has consequences for how words are ordered alphabetically. However, not all occurrences of "aa" are pronounced as "å". In fact, today it is used this way more or less exclusively in family names. You won't find any words in a dictionary where a double a is to be pronounced as "å", only in lists of names. So if you have a word like "ekstraarbeid" (an actual word found in the dictionary), it should be listed before "ekstrabetaling" (another actual word), even though aa = å > b, because the double a is pronounced as two separate a's. Similarly, in the phone book, you will find "Haase" before "Hatlen" (aa in Haase is a long a, hence counted as two letters), but you'll find "Wanvik" before "Waagan" (aa in Waagan is pronounced and alphabetized as å). This has some funny consequences like that the very first name in the phone book for Trondheim, Norway is "Aalaei", whereas the last name you find in it is "Aavitsland". So, my expectation is that there is some way to have a list of words sorted like this: Aalaei ekstraarbeid ekstrabetaling Haase Hatlen Wanvik Waagan Aavitsland The way these words are sorted currently with territory based collation and Norwegian territory is: ekstrabetaling ekstraarbeid Hatlen Haase Wanvik Waagan Aalaei Aavitsland I skimmed through the Unicode Collation Algorithm at http://unicode.org/reports/tr10/ to find out how this were to be handled. A paragraph under 3.1.1 Multiple Mappings said: Any character (such as soft hyphen) that is not completely ignorable between two characters of a contraction will cause them to sort as separate characters. Thus a soft hyphen can be used to separate and cause distinct weighting of sequences such as Slovak ch or Danish aa that would normally weight as units. This sounds like what I need, and placing a soft hyphen between the a's that I wanted to be interpreted as two single letters, did indeed give me the sorting order I wanted. However, even though the sorting seems to ignore the soft hyphens (actually, it seems to ignore all kinds of punctuation characters), string matching does not ignore them, so 'H_ase' does not match 'Ha<soft-hyphen>ase' with the LIKE predicate. Is this supposed to be possible, that is, to let LIKE regard 'aa' (or 'a<some-special-char>a') as two separate yet consecutive letters? -- Knut Anders
