Well, Just to make a global figure, ICU collation is to binary collation what CouchDB is to RDBMS.
If you look at the link you gave for algorithm reference, you have : On 1.1 multi level comparison, you have in the table this example : L2 Accents role < r*ô*le < roles To be complete you should also show : L2 Accents role < r*ô*le < roles < r*ô*les Seems stupid but it's not as it's a language, and o or ô are same letter, as they explain under : "For example, the L2 example shows that difference between an *o* and an accented * ô* is swamped by an L1 difference (the presence or absence of an *s*)" If you even go to 1.3, you have in the french ordering : Normal Accent Ordering cote < cot*é* < c*ô*te < côté French Accent Ordering cote < c*ô*te < cot*é* < côté To be fair i'm French, so may be that's the reason it not seems odd to me. Now if we follow the main algorithm in 4. : we have with your three "words" : a A aa, i even add AA Step 1 : Normalize (easy no special character from your example) a -> a as LATIN SMALL LETTER A (see http://www.unicode.org/Public/UCA/latest/allkeys.txt) A -> A as LATIN CAPITAL LETTER A aa -> aa as LATIN SMALL LETTER A , LATIN SMALL LETTER A Step 2 : Produce Array : a -> [.1141.0020.0002.0061] A ->[.1141.0020.0008.0041] aa -> [.1141.0020.0002.0061], [.1141.0020.0002.0061] AA ->[.1141.0020.0008.0041], [.1141.0020.0008.0041] Step 3 : Form Sort Key the algo is something like : s = string to sort sArray = String to sort produce as array So : s="a"=["a"] sArray=[[1141,0020,002,0061]] s="A"=["A"] sArray=[[1141,0020,008,0041]] s="aa"=["a","a"] sArray=[[1141,0020,002,0061],[1141,0020,002,0061]] s="A"=["A"] sArray=[[1141,0020,008,0041],[1141,0020,008,0041]] (let's say we use Level 3) sortKey=[]; for (i=0; i<3; i++) { for(j=0; j<sArray.length(), j++) { sortKey.push(sArray[j][i]); } } so we have : a -> [1141,0020,002] A -> [1141,0020,008] aa-> [1141,1141,020,020,002,002] AA ->[1141,1141,020,020,008,008] Step 4 : Compare : First element : 1141 ==1141==1141==1141 No order yet, Second element : 0020==0020 < 1141==1141 So order up to now is a==A < aa==AA Third element (only for a==A): 002<008 so a < A < aa==AA Third element (only for aa=AA) 020==020 still aa=AA Fourth element (only for aa==AA) 020==020 still aa=AA Fifth element (only for aa==AA) 002==008 so aa < AA Finally a<A<aa<AA So Paul, is it ok for you now ? 2009/4/10 Paul Davis <[email protected]> > Perhaps I'm not explaining my expectation well enough. > > The way I read the algorithm, the basic idea is that you take the > weights for each character and concatenate them. then you run throw > these representations and do a basic element wise comparison. > > I know that a and A have the same primary and secondary weights. But > they have different tertiary weights. > > Thus as I read the algorithm (most likely missing the important clause > affecting this) then when I compare a and A, A > a. I don't see why > any other character is considered. A is fucking bigger than a. If > there is a section that says, "Oh, btw, if one string is an exact > prefix of the other as defined solely by primary weights, then the > prefix sorts first" i would be happy as a pig in shit. > > > > On Thu, Apr 9, 2009 at 7:18 PM, Patrick Antivackis > <[email protected]> wrote: > > Paul, > > > > 2009/4/10 Paul Davis <[email protected]> > > > >> I've tried various combinations of UC_CASE_LEVEL, UC_CASE_FIRST, and > >> UC_WEIGHT. > >> > > > > This is really not enough. Doing this you only try to say to the > collation > > that a<<<A or A<<<a (third element) > > but still it's an a upper, lower, witha tilde, with an accent or > wathever. > > All are just variation of A but still A. > > > > If you look at : > > http://www.unicode.org/Public/UCA/latest/allkeys.txt > > > > and search for : > > > > 0061 ; [.1141.0020.0002.0061] # LATIN SMALL LETTER A > > > > > > you will see a lot of A definition, but all have the same first element > > :1141, they are all the same letter, other are just variation. So > compared > > to each of them they have an order but compared with an other letter they > > all behave the same like an A > > > > So, now if you want to change order of primary element, you need to use > > custom tailoring : > > http://userguide.icu-project.org/collation/customization > > > > And you need to say thing like : > > a < A (primary order) > > So to simulate ASCII behaviour you should try something like : a < b < c > <d > > < ......<A <B <C ....., so almost retype the ASCII table. > > > > To be honest, i not tried, but that should work > > > > Let me reiterate. I do *not* want A > b. I want A fucking greater than > a. Or not greater. But this relation: > > a < A < aa > > to me means: > > a < A < a > > Am I the only one that finds that just a bit ridiculous? > > HTH, > Paul Davis > > p.s. the cursing isn't directed at anyone here. I'm just fairly > frustrated by that unicode algorithm. > > > > > > > > > > > > >> Also, I still don't see anything in this damned collation algorithm > >> that explains how A < aa. And this doesn't fall into the big/biggest > >> comparison. The similar case would be Big < biggest. But I don't see > >> anything in the damn collation algorithm document that talks about > >> ignoring anything after primary weight in the case that one string is > >> a prefix. In the various examples that I see I can't find anything > >> that would contradict that expectation. > >> > >> Paul Davis > >> > >> For reference, the algorithm reference I'm using is this one: > >> http://unicode.org/reports/tr10/ > >> > >> I feel like printing the entire thing just so I can have a book burning. > >> > >> On Thu, Apr 9, 2009 at 6:39 PM, Patrick Antivackis > >> <[email protected]> wrote: > >> > By the way, what customization did you try to send to ICU ? > >> > > >> > 2009/4/10 Paul Davis <[email protected]> > >> > > >> >> Patrick, > >> >> > >> >> I'm not asking for this relationship: > >> >> > >> >> a < b < A < B > >> >> > >> >> Merely: > >> >> > >> >> a < aa < A > >> >> > >> >> The thing is that even when I try and specify explicitly that 'A; > >> >> should come after 'a' I still can't get the expected "a < aa < A" > >> >> behavior. In a nutshell, "Why the hell does the second 'a' alter the > >> >> comparison?" > >> >> > >> >> HTH, > >> >> Paul Davis > >> >> > >> >> On Thu, Apr 9, 2009 at 5:45 PM, Patrick Antivackis > >> >> <[email protected]> wrote: > >> >> > It's quite normal as far as ICU is concerned. > >> >> > ICU is about language not about ASCII code. > >> >> > In ICU, case is the third element looked for comparison (same level > >> than > >> >> > circled letter in Nordic languages for example), so not very > >> important. > >> >> > So when you sort words together, a or A is still an a, so they are > >> sorted > >> >> > nearby. In ICU you can specify if you prefer a before A or A before > a, > >> >> but > >> >> > not simply a before b before c.... before A before B before C. > >> >> > > >> >> > To have such behavior (like ASCII) you need to custom ICU in > >> specifying > >> >> the > >> >> > collation you want almost letter by letter. > >> >> > It is great for you, but what about Japanese users or Arabic users > ?? > >> >> > > >> >> > So this is definitely the right behaviour of ICU sorting > (collation). > >> >> > > >> >> > > >> >> > 2009/4/9 Brian Candler <[email protected]> > >> >> > > >> >> >> > I've spent entirely too long on this now and I still can't for > the > >> >> >> > life of me figure out why A < aa. > >> >> >> > >> >> >> Time for an experimental, black-box approach: > >> >> >> > >> >> >> ---- > >> >> >> require 'rubygems' > >> >> >> require 'restclient' > >> >> >> require 'json' > >> >> >> > >> >> >> DB="http://127.0.0.1:5984/collator" > >> >> >> > >> >> >> RestClient.delete DB rescue nil > >> >> >> RestClient.put "#{DB}","" > >> >> >> > >> >> >> (32..126).each do |c| > >> >> >> RestClient.put "#{DB}/#{c.to_s(16)}", {"x"=>c.chr}.to_json > >> >> >> end > >> >> >> > >> >> >> RestClient.put "#{DB}/_design/test", <<EOS > >> >> >> { > >> >> >> "views":{ > >> >> >> "one":{ > >> >> >> "map":"function (doc) { emit(doc.x,null); }" > >> >> >> } > >> >> >> } > >> >> >> } > >> >> >> EOS > >> >> >> > >> >> >> puts RestClient.get("#{DB}/_design/test/_view/one") > >> >> >> ---- > >> >> >> > >> >> >> This shows the collation sequence to be as follows. > >> >> >> > >> >> >> {"total_rows":95,"offset":0,"rows":[ > >> >> >> {"id":"20","key":" ","value":null}, > >> >> >> {"id":"60","key":"`","value":null}, > >> >> >> {"id":"5e","key":"^","value":null}, > >> >> >> {"id":"5f","key":"_","value":null}, > >> >> >> {"id":"2d","key":"-","value":null}, > >> >> >> {"id":"2c","key":",","value":null}, > >> >> >> {"id":"3b","key":";","value":null}, > >> >> >> {"id":"3a","key":":","value":null}, > >> >> >> {"id":"21","key":"!","value":null}, > >> >> >> {"id":"3f","key":"?","value":null}, > >> >> >> {"id":"2e","key":".","value":null}, > >> >> >> {"id":"27","key":"'","value":null}, > >> >> >> {"id":"22","key":"\"","value":null}, > >> >> >> {"id":"28","key":"(","value":null}, > >> >> >> {"id":"29","key":")","value":null}, > >> >> >> {"id":"5b","key":"[","value":null}, > >> >> >> {"id":"5d","key":"]","value":null}, > >> >> >> {"id":"7b","key":"{","value":null}, > >> >> >> {"id":"7d","key":"}","value":null}, > >> >> >> {"id":"40","key":"@","value":null}, > >> >> >> {"id":"2a","key":"*","value":null}, > >> >> >> {"id":"2f","key":"/","value":null}, > >> >> >> {"id":"5c","key":"\\","value":null}, > >> >> >> {"id":"26","key":"&","value":null}, > >> >> >> {"id":"23","key":"#","value":null}, > >> >> >> {"id":"25","key":"%","value":null}, > >> >> >> {"id":"2b","key":"+","value":null}, > >> >> >> {"id":"3c","key":"<","value":null}, > >> >> >> {"id":"3d","key":"=","value":null}, > >> >> >> {"id":"3e","key":">","value":null}, > >> >> >> {"id":"7c","key":"|","value":null}, > >> >> >> {"id":"7e","key":"~","value":null}, > >> >> >> {"id":"24","key":"$","value":null}, > >> >> >> {"id":"30","key":"0","value":null}, > >> >> >> {"id":"31","key":"1","value":null}, > >> >> >> {"id":"32","key":"2","value":null}, > >> >> >> {"id":"33","key":"3","value":null}, > >> >> >> {"id":"34","key":"4","value":null}, > >> >> >> {"id":"35","key":"5","value":null}, > >> >> >> {"id":"36","key":"6","value":null}, > >> >> >> {"id":"37","key":"7","value":null}, > >> >> >> {"id":"38","key":"8","value":null}, > >> >> >> {"id":"39","key":"9","value":null}, > >> >> >> {"id":"61","key":"a","value":null}, > >> >> >> {"id":"41","key":"A","value":null}, > >> >> >> {"id":"62","key":"b","value":null}, > >> >> >> {"id":"42","key":"B","value":null}, > >> >> >> {"id":"63","key":"c","value":null}, > >> >> >> {"id":"43","key":"C","value":null}, > >> >> >> {"id":"64","key":"d","value":null}, > >> >> >> {"id":"44","key":"D","value":null}, > >> >> >> {"id":"65","key":"e","value":null}, > >> >> >> {"id":"45","key":"E","value":null}, > >> >> >> {"id":"66","key":"f","value":null}, > >> >> >> {"id":"46","key":"F","value":null}, > >> >> >> {"id":"67","key":"g","value":null}, > >> >> >> {"id":"47","key":"G","value":null}, > >> >> >> {"id":"68","key":"h","value":null}, > >> >> >> {"id":"48","key":"H","value":null}, > >> >> >> {"id":"69","key":"i","value":null}, > >> >> >> {"id":"49","key":"I","value":null}, > >> >> >> {"id":"6a","key":"j","value":null}, > >> >> >> {"id":"4a","key":"J","value":null}, > >> >> >> {"id":"6b","key":"k","value":null}, > >> >> >> {"id":"4b","key":"K","value":null}, > >> >> >> {"id":"6c","key":"l","value":null}, > >> >> >> {"id":"4c","key":"L","value":null}, > >> >> >> {"id":"6d","key":"m","value":null}, > >> >> >> {"id":"4d","key":"M","value":null}, > >> >> >> {"id":"6e","key":"n","value":null}, > >> >> >> {"id":"4e","key":"N","value":null}, > >> >> >> {"id":"6f","key":"o","value":null}, > >> >> >> {"id":"4f","key":"O","value":null}, > >> >> >> {"id":"70","key":"p","value":null}, > >> >> >> {"id":"50","key":"P","value":null}, > >> >> >> {"id":"71","key":"q","value":null}, > >> >> >> {"id":"51","key":"Q","value":null}, > >> >> >> {"id":"72","key":"r","value":null}, > >> >> >> {"id":"52","key":"R","value":null}, > >> >> >> {"id":"73","key":"s","value":null}, > >> >> >> {"id":"53","key":"S","value":null}, > >> >> >> {"id":"74","key":"t","value":null}, > >> >> >> {"id":"54","key":"T","value":null}, > >> >> >> {"id":"75","key":"u","value":null}, > >> >> >> {"id":"55","key":"U","value":null}, > >> >> >> {"id":"76","key":"v","value":null}, > >> >> >> {"id":"56","key":"V","value":null}, > >> >> >> {"id":"77","key":"w","value":null}, > >> >> >> {"id":"57","key":"W","value":null}, > >> >> >> {"id":"78","key":"x","value":null}, > >> >> >> {"id":"58","key":"X","value":null}, > >> >> >> {"id":"79","key":"y","value":null}, > >> >> >> {"id":"59","key":"Y","value":null}, > >> >> >> {"id":"7a","key":"z","value":null}, > >> >> >> {"id":"5a","key":"Z","value":null} > >> >> >> ]} > >> >> >> > >> >> >> I've never seen this sequence before. It's not even EBCDIC :-) > >> >> >> > >> >> >> Adding aa into the pot gives: > >> >> >> > >> >> >> ... > >> >> >> {"id":"61","key":"a","value":null}, > >> >> >> {"id":"41","key":"A","value":null}, > >> >> >> {"id":"X","key":"aa","value":null}, > >> >> >> ... > >> >> >> > >> >> >> As you say, that is most bizarre. > >> >> >> > >> >> >> Cheers, > >> >> >> > >> >> >> Brian. > >> >> >> > >> >> > > >> >> > >> > > >> > > >
