RE: [IndexedDB] Closing on bug 9903 (collations)

Pablo Castro Fri, 17 Jun 2011 11:45:39 -0700

From: [email protected] [mailto:[email protected]] On 
Behalf Of Keean Schupke
Sent: Tuesday, May 31, 2011 11:51 PM


>> On 1 June 2011 01:37, Pablo Castro <[email protected]> wrote:
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Aryeh 
>> Gregor
>> Sent: Tuesday, May 31, 2011 3:49 PM
>>
>> >> On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
>> >> <[email protected]> wrote:
>> >> > No, that was poor wording on my part, I keep using "locale" in the 
>> >> > wrong context. I meant to have the API take a proper collation 
>> >> > identifier. The identifier can be as specific as the caller wants it to 
>> >> > be. The implementation could choose to not honor some specific detail 
>> >> > if it can't handle it (to the extent that doing so is allowed by the 
>> >> > specification of collation names), or fail because it considers that 
>> >> > not handling a particular aspect of the collation identifier would 
>> >> > severely deviate from the caller's expectations.
>> >>
>> >> I'm not sure I understand you.  My personal opinion is that there
>> >> should be no undefined behavior here.  If authors are allowed to pass
>> >> collation identifiers, the spec needs to say exactly how they're to be
>> >> interpreted, so the same identifier passed to two different browsers
>> >> will result in the same collation, i.e., the same strings need to sort
>> >> the same cross-browser.  Having only binary collation is better than
>> >> having non-binary collations but not defining them, IMO.
>> I thought BCP47 allowed implementations to drop subtags if needed. I just 
>> re-read the spec and it seems that it only allows to do that in constrained 
>> cases where you can't fit the whole name in your buffer (which wouldn't 
>> apply to the context discussed here). My first instinct is that this is 
>> quite a bit to guarantee (full consistency in collation), but it seems that 
>> that's what the spec is shooting for.
>>
>> >> > Given the amount of debate on this, could we at least agree that we can 
>> >> > do binary for v1? We can then have an open item for v2 on taking 
>> >> > collation names and sort according to UCA or taking callbacks and such.
>> >>
>> >> I'm okay with supporting only binary to start with.
>> Great. I'll still wait a bit to see what other folks think, and then update 
>> the bug one way or the other.
>>
>> Thanks
>> -pablo
>>
>> The discussion sounds like it is headed in the right direction. Are there 
>> any issues with non-unicode encodings that need to be dealt with (HTTP 
>> headers default to ISO-8859 I think). Would people be expected to convert on 
>> read into UTF-16 strings or use typed-arrays?

I asked around here and folks actually pointed out that the JavaScript spec 
seems to be describing exactly what we needed. Looking at here [1], section 
11.8.5, the relevant fragment starting at step 4 goes:

Else, both px and py are Strings
    a. If py is a prefix of px, return false. (A String value p is a prefix of 
String value q if q can be the result of concatenating p and some other String 
r. Note that any String is a prefix of itself, because r may be the empty 
String.)
    b. If px is a prefix of py, return true.
    c. Let k be the smallest nonnegative integer such that the character at 
position k within px is different from the character at position k within py. 
(There must be such a k, for neither String is a prefix of the other.)
    d. Let m be the integer that is the code unit value for the character at 
position k within px.
    e. Let n be the integer that is the code unit value for the character at 
position k within py.
    f. If m < n, return true. Otherwise, return false.

It also has a note below indicating:

NOTE 2 The comparison of Strings uses a simple lexicographic ordering on 
sequences of code unit values. There is no attempt to use the more complex, 
semantically oriented definitions of character or string equality and collating 
order defined in the Unicode specification. Therefore String values that are 
canonically equal according to the Unicode standard could test as unequal. In 
effect this algorithm assumes that both Strings are already in normalised form. 
Also, note that for strings containing supplementary characters, lexicographic 
ordering on sequences of UTF-16 code unit values differs from that on sequences 
of code point values.

Which is very much in line with what we've been discussing, and has the extra 
feature of being compatible with JavaScript order. 

So it looks like we could reference (or inline) this in the spec and have a 
fully specified order for keys with string content.

Thoughts? 

Thanks
-pablo

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf

RE: [IndexedDB] Closing on bug 9903 (collations)

Reply via email to