On Wed, Oct 19, 2011 at 6:33 PM, Martin Schreiber <mse00...@gmail.com> wrote: > Does it use locale specific collation in PasUnicodeCompareStr and > PasUnicodeCompareText?
Good point, no, not yet. But this affects only turkish, azeri and lithuanian AFAIK Adding turkish and azeri is trivial, because UTF8LowerCase supports them, but I did not understand yet the rules for Lithuanian, they are quite convoluted, depend on nearby chars and stuff like that. > Is the performance of UTF8LowerCase and UTF8UpperCase OK? UTF8LowerCase was heavily optimized. UTF8UpperCase still needs to be more optimized. 6 million UTF8LowerCase operations in the string "АБВЕЁЖЗКЛМНОПРДЙГ" takes 2,6 seconds in my computer. It outperforms iconv by a factor of 2,5x aprox: UTF8LowerCase-- Performance test took: 804 ms 1896 ms 2318 ms 3460 ms 2647 ms 1847 ms 2526 ms 2496 ms 1830 ms 1975 ms CWString SysUtils.UnicodeLowerCase-- Performance test took: 2456 ms 2461 ms 6594 ms 6170 ms 5347 ms 6939 ms 4398 ms 4429 ms 2285 ms 2411 ms For this strings: if j = 0 then Str := UTF8LowerCase('abcdefghijklmnopqrstuwvxyz'); if j = 1 then Str := UTF8LowerCase('ABCDEFGHIJKLMNOPQRSTUWVXYZ'); if j = 2 then Str := UTF8LowerCase('aąbcćdeęfghijklłmnńoóprsśtuwyzźż'); if j = 3 then Str := UTF8LowerCase('AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ'); if j = 4 then Str := UTF8LowerCase('АБВЕЁЖЗКЛМНОПРДЙГ'); if j = 5 then Str := UTF8LowerCase('名字叫嘉英,嘉陵江的嘉,英國的英'); if j = 6 then Str := UTF8LowerCase('AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuWvVwXxYyZz'); if j = 7 then Str := UTF8LowerCase('AAaaBBbbCCccDDddEEeeFFffGGggHHhhIIiiJJjjKKkkLLllMMmm'); if j = 8 then Str := UTF8LowerCase('abcDefgHijkLmnoPqrsTuwvXyz'); if j = 9 then Str := UTF8LowerCase('ABCdEFGhIJKlMNOpQRStUWVxYZ'); > Do UTF8LowerCase and UTF8UpperCase cover all upper/lowercase Unicode > (possibly accented) characters? UTF8LowerCase currently covers all characters in the latest Unicode spec AFAIK. Of course I might have forgotten something, but I have tests for chars from 0000 to 0580 and more tests for other clusters. UTF8UpperCase is currently implemented from 0000 to 0450, but I will add the rest. > Does it handle decomposed characters (cwstring doesn't)? I think that decomposed characters should work naturally. See, for example, if we have: [0]=~ (tilde accent, but the special version for composition) [1]=A which forms "Ã" and then we pass lowercase into it, we would get [0] without change and [1]=a which forms "ã". Or am I wrong? If you are talking about handling for CompareText, then the answer would be that AFAIK it would be too inneficient to handle that in CompareText ... so we would need another routine for that NormalizedCompareText or something like that, which executes normalization, then lowercase and finally the comparison. -- Felipe Monteiro de Carvalho -- _______________________________________________ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus