> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <gir...@pingpong.net>: > >> >> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <p...@heroku.com>: >> >> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <gir...@pingpong.net> >> wrote: >>> They've been used for the FreeBSD ports since 2005, and have served us >>> well. I have of course updated them regularly. In this latest version, I've >>> removed support for other encodings beside UTF-8, mostly since I don't know >>> how to test them, but also, I see little point in supporting anything else >>> using ICU. >> >> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5 >> is not the release that introduced its use, it did expand it >> significantly. I think you need to fix this, even though it isn't >> actually used to sort text at present, since presumably FreeBSD builds >> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could >> reasonably trust the ICU equivalent of strxfrm(), so that's a missed >> opportunity. (See the wiki page on the abbreviated keys issue [1] if >> you don't know what I'm talking about.) > > My plan was to get it working without TRUST_STRXFRM first, and then add that > functinality. I've made some preliminary tests using ICU:s ucol_getSortKey > but I will have to test it a bit more. For now, I just expect not to trust > strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code > base.
Here are some preliminary results running 10000 times comparing the same two strings in a tight loop. ucol_strcollUTF8: -1 0.002448 strcoll: 1 0.060711 ucol_strcollIter: -1 0.009221 direct memcmp: 1 0.000457 memcpy memcmp: 1 0.001706 memcpy strcoll: 1 0.068425 nextSortKeyPart: -1 0.041011 ucnv_toUChars + getSortKey: -1 0.050379 correct answer is -1, but since we compare åasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times faster than ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart, is way slower. startTime = getRealTime(); for ( int i = 0; i < loop; i++) { result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, &status); } endTime = getRealTime(); printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime); vs int sortkeysize=8; startTime = getRealTime(); uint8_t key1[sortkeysize], key2[sortkeysize]; uint32_t sState[2], tState[2]; UCharIterator sIter, tIter; for ( int i = 0; i < loop; i++) { uiter_setUTF8(&sIter, arg1, len1); uiter_setUTF8(&tIter, arg2, len2); sState[0] = 0; sState[1] = 0; tState[0] = 0; tState[1] = 0; ucol_nextSortKeyPart(collator, &sIter, sState, key1, sortkeysize, &status); ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize, &status); result = memcmp (key1, key2, sortkeysize); } endTime = getRealTime(); printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime); But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remember correctly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess... Palle
signature.asc
Description: Message signed with OpenPGP using GPGMail