Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Tom Lane Sat, 18 Feb 2012 16:30:16 -0800

I wrote:
> And here's a poorly-tested draft patch for that.

I've done some more testing now, and am satisfied that this works as
intended.  However, some crude performance testing suggests that people
might be annoyed with it.  As an example, in 9.1 with pl_PL.utf8 locale,
I see this:
        select 'aaaaaaaaaa' ~ '\w\w\w\w\w\w\w\w\w\w\w';
taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent
executions, the difference being the time needed to compile and cache
the DFA representation of the regexp.  With the patch, the numbers are
more like 5 ms and 0.4 ms, meaning the compilation time has gone up by
something near a factor of 10, though AFAICT execution time hasn't
moved.  It's hard to tell how significant that would be to real-world
queries, but in the worst case where our caching of regexps doesn't help
much, it could be disastrous.


All of the extra time is in manipulation of the much larger number of
DFA arcs required to represent all the additional character codes that
are being considered to be letters.

Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this
as-is; I think the number of people who are hurt by the performance
degradation will be greatly larger than the number who are glad because
characters in $random_alphabet are now seen to be letters.  I think an
actually workable solution will require something like what I speculated
about earlier:

> Yeah, it's conceivable that we could implement something whereby
> characters with codes above some cutoff point are handled via runtime
> calls to iswalpha() and friends, rather than being included in the
> statically-constructed DFA maps.  The cutoff point could likely be a lot
> less than U+FFFF, too, thereby saving storage and map build time all
> round.

In the meantime, I still think the caching logic is worth having, and
we could at least make some people happy if we selected a cutoff point
somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
what a good compromise cutoff would be.  One possibility is U+7FF, which
corresponds to the limit of what fits in 2-byte UTF8; but I don't know
if that corresponds to any significant dropoff in frequency of usage.

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Reply via email to