I wrote: > And here's a poorly-tested draft patch for that. I've done some more testing now, and am satisfied that this works as intended. However, some crude performance testing suggests that people might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale, I see this: select 'aaaaaaaaaa' ~ '\w\w\w\w\w\w\w\w\w\w\w'; taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent executions, the difference being the time needed to compile and cache the DFA representation of the regexp. With the patch, the numbers are more like 5 ms and 0.4 ms, meaning the compilation time has gone up by something near a factor of 10, though AFAICT execution time hasn't moved. It's hard to tell how significant that would be to real-world queries, but in the worst case where our caching of regexps doesn't help much, it could be disastrous.
All of the extra time is in manipulation of the much larger number of DFA arcs required to represent all the additional character codes that are being considered to be letters. Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this as-is; I think the number of people who are hurt by the performance degradation will be greatly larger than the number who are glad because characters in $random_alphabet are now seen to be letters. I think an actually workable solution will require something like what I speculated about earlier: > Yeah, it's conceivable that we could implement something whereby > characters with codes above some cutoff point are handled via runtime > calls to iswalpha() and friends, rather than being included in the > statically-constructed DFA maps. The cutoff point could likely be a lot > less than U+FFFF, too, thereby saving storage and map build time all > round. In the meantime, I still think the caching logic is worth having, and we could at least make some people happy if we selected a cutoff point somewhere between U+FF and U+FFFF. I don't have any strong ideas about what a good compromise cutoff would be. One possibility is U+7FF, which corresponds to the limit of what fits in 2-byte UTF8; but I don't know if that corresponds to any significant dropoff in frequency of usage. regards, tom lane -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers