On Thu, Jun 5, 2014 at 8:37 PM, Peter Geoghegan <p...@heroku.com> wrote: > * A configure AC_TRY_RUN tests the suitability of the system strxfrm() > implementation for the purposes of this optimization. There can be no > "header bytes" (people at pgCon reported redundant bytes on the Mac > OSX implementation at pgCon, to my great surprise), ...
I'm attaching a small test program demonstrating the Mac behavior. Here's some sample output: [rhaas ~]$ ./strxfrm en_US "" a aa ab abc abcd abcde peter Geoghegan _ % "" -> "" "a" -> "001S0000001S" "aa" -> "001S001S0000001S001S" "ab" -> "001S001T0000001S001T" "abc" -> "001S001T001U0000001S001T001U" "abcd" -> "001S001T001U001V0000001S001T001U001V" "abcde" -> "001S001T001U001V001W0000001S001T001U001V001W" "peter" -> "001b001W001f001W001d0000001b001W001f001W001d" "Geoghegan" -> "0019001W001a001Y001Z001W001Y001S001`00000019001W001a001Y001Z001W001Y001S001`" "_" -> "001Q0000001Q" "%" -> "000W0000000W" It appears that any string starting with the letter "a" will create output that begins with 001S00 and the seventh character always appears to be 0 or 1: [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a " "ab" -> "001S001T0000001S001T" "ac" -> "001S001U0000001S001U" "ad" -> "001S001V0000001S001V" "ae" -> "001S001W0000001S001W" "af" -> "001S001X0000001S001X" "a%" -> "001S000W0000001S000W" "a0" -> "001S000b0000001S000b" "a " -> "001S000R0000001S000R" Also, the total number of bytes produced is apparently 8N+4, where N is the length of the input string. On a Linux system I tested, the output included non-printable characters, so I adapted the test program to print the results in hex. Attaching that version, too. Here, the length was 3N+2, except for 0-length strings which produce a 0-length result: [rhaas@hydra ~]$ ./strxfrm-binary en_US "" a aa ab abc abcd abcde peter Geoghegan _ % "" -> (0 bytes) "a" -> 0c01020102 (5 bytes) "aa" -> 0c0c010202010202 (8 bytes) "ab" -> 0c0d010202010202 (8 bytes) "abc" -> 0c0d0e0102020201020202 (11 bytes) "abcd" -> 0c0d0e0f01020202020102020202 (14 bytes) "abcde" -> 0c0d0e0f10010202020202010202020202 (17 bytes) "peter" -> 1b101f101d010202020202010202020202 (17 bytes) "Geoghegan" -> 12101a121310120c190102020202020202020201040202020202020202 (29 bytes) "_" -> 0101010112 (5 bytes) "%" -> 0101010139 (5 bytes) [rhaas@hydra ~]$ ./strxfrm-binary en_US ab ac ad ae af a% a0 "a " "ab" -> 0c0d010202010202 (8 bytes) "ac" -> 0c0e010202010202 (8 bytes) "ad" -> 0c0f010202010202 (8 bytes) "ae" -> 0c10010202010202 (8 bytes) "af" -> 0c11010202010202 (8 bytes) "a%" -> 0c01020102010239 (8 bytes) "a0" -> 0c02010202010202 (8 bytes) "a " -> 0c01020102010211 (8 bytes) Even though each input bytes is generating 3 output bytes, it's not generating a group of output bytes for each input byte as appears to be happening on MacOS X. If it were doing that, then truncating the blob to 8 bytes would only compare the first 2-3 bytes of the input string. In fact we do better. If we change even the 8th letter in the string to some other letter, the 8th output byte changes: [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaab "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes) "aaaaaaab" -> 0c0c0c0c0c0c0c0d010202020202020202010202020202020202 (26 bytes) If we change the capitalization of the eighth byte, then the change happens further out: [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes) "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes) Still, it's fair to say that on this Linux system, the first 8 bytes capture a significant portion of the entropy of the first 8 bytes of the string, whereas on MacOS X you only get entropy from the first 2 bytes of the string. It would be interesting to see results from other platforms people might care about also. > The cost of adding all of these ameliorating measures appears to be > very low. We're so bottlenecked on memory bandwidth that the > fixed-size overhead of maintaining poor man cardinality, and the extra > cycles from hashing n keys for the benefit of HyperLogLog just don't > seem to matter next to the big savings for n log n comparisons. The > best case performance is seemingly just as good as before (about a > 200% improvement in transaction throughput for one client, but closer > to a 250% improvement with many clients and/or expensive collations), > while the worst case is much much better. I haven't looked at the patch yet, but this sounds promising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
#include <stdio.h> #include <stdlib.h> #include <errno.h> #include <locale.h> #include <string.h> int main(int argc, char **argv) { char *buffer = NULL; size_t buflen = 0; if (argc < 3) { fprintf(stderr, "Usage: strxfrm {collation} {string}...\n"); exit(1); } if (setlocale(LC_COLLATE, argv[1]) == NULL) { fprintf(stderr, "setlocale: %s\n", strerror(errno)); exit(1); } argv += 2; while (*argv) { size_t r; r = strxfrm(buffer, *argv, buflen); if (r > buflen) { if (buffer != NULL) free(buffer); buflen = r + 1; buffer = malloc(buflen); if (buffer == NULL) { fprintf(stderr, "out of memory\n"); exit(1); } continue; } printf("\"%s\" -> \"%s\"\n", *argv, buffer); ++argv; } exit(0); }
#include <stdio.h> #include <stdlib.h> #include <errno.h> #include <locale.h> #include <string.h> int main(int argc, char **argv) { char *buffer = NULL; size_t buflen = 0; char *extrabuf = NULL; size_t extralen = 0; const char *hexdigit = "0123456789abcdef"; if (argc < 3) { fprintf(stderr, "Usage: strxfrm {collation} {string}...\n"); exit(1); } if (setlocale(LC_COLLATE, argv[1]) == NULL) { fprintf(stderr, "setlocale: %s\n", strerror(errno)); exit(1); } argv += 2; while (*argv) { size_t r; size_t k; r = strxfrm(buffer, *argv, buflen); if (r > buflen) { if (buffer != NULL) free(buffer); buflen = r + 1; buffer = malloc(buflen); if (buffer == NULL) { fprintf(stderr, "out of memory\n"); exit(1); } continue; } if (extralen < 2 * r + 1) { if (extrabuf != NULL) free(extrabuf); extralen = 2 * r + 1; extrabuf = malloc(extralen); if (extrabuf == NULL) { fprintf(stderr, "out of memory\n"); exit(1); } } for (k = 0; k < r; ++k) { unsigned char c = buffer[k]; extrabuf[2 * k] = hexdigit[c >> 4]; extrabuf[2 * k + 1] = hexdigit[c & 15]; } extrabuf[2 * k] = '\0'; printf("\"%s\" -> %s (%d bytes)\n", *argv, extrabuf, r); ++argv; } exit(0); }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers