Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Palle Girgensohn Thu, 11 Aug 2016 04:23:01 -0700

> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <gir...@pingpong.net>:
> 
>> 
>> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <p...@heroku.com>:
>> 
>> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <gir...@pingpong.net> 
>> wrote:
>>> They've been used for the FreeBSD ports since 2005, and have served us 
>>> well. I have of course updated them regularly. In this latest version, I've 
>>> removed support for other encodings beside UTF-8, mostly since I don't know 
>>> how to test them, but also, I see little point in supporting anything else 
>>> using ICU.
>> 
>> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
>> is not the release that introduced its use, it did expand it
>> significantly. I think you need to fix this, even though it isn't
>> actually used to sort text at present, since presumably FreeBSD builds
>> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
>> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
>> opportunity. (See the wiki page on the abbreviated keys issue [1] if
>> you don't know what I'm talking about.)
> 
> My plan was to get it working without TRUST_STRXFRM first, and then add that 
> functinality. I've made some preliminary tests using ICU:s ucol_getSortKey 
> but I will have to test it a bit more. For now, I just expect not to trust 
> strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code 
> base.


Here are some preliminary results running 10000 times comparing the same two 
strings in a tight loop.

              ucol_strcollUTF8: -1      0.002448
                       strcoll: 1       0.060711
              ucol_strcollIter: -1      0.009221
                 direct memcmp: 1       0.000457
                 memcpy memcmp: 1       0.001706
                memcpy strcoll: 1       0.068425
               nextSortKeyPart: -1      0.041011
    ucnv_toUChars + getSortKey: -1      0.050379


correct answer is -1, but since we compare åasdf and äasdf with a Swedish 
locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 
times faster than ucol_strcollUTF8 (used in my patch), but sadly the best 
implementation using sort keys with ICU, nextSortKeyPart, is way slower.



        startTime = getRealTime();
        for ( int i = 0; i < loop; i++) {
                result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, 
&status);
        }
        endTime = getRealTime();
        printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - 
startTime);



vs


        int sortkeysize=8;

        startTime = getRealTime();
        uint8_t key1[sortkeysize], key2[sortkeysize];
        uint32_t sState[2], tState[2];
        UCharIterator sIter, tIter;

        for ( int i = 0; i < loop; i++) {
                uiter_setUTF8(&sIter, arg1, len1);
                uiter_setUTF8(&tIter, arg2, len2);
                sState[0] = 0; sState[1] = 0;
                tState[0] = 0; tState[1] = 0;
                ucol_nextSortKeyPart(collator, &sIter, sState, key1, 
sortkeysize, &status);
                ucol_nextSortKeyPart(collator, &tIter, tState, key2, 
sortkeysize, &status);
                result = memcmp (key1, key2, sortkeysize);
        }
        endTime = getRealTime();
        printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - 
startTime);



But in your strxfrm code in PostgreSQL, the keys are cached, and represented as 
int64:s if I remember correctly, so perhaps there is still a benefit using the 
abbreviated keys? More testing is required, I guess...

Palle

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Reply via email to