Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Robert Haas Sat, 18 Feb 2012 19:33:45 -0800

On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps.  The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
> what a good compromise cutoff would be.  One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.


The problem, of course, is that this probably depends quite a bit on
what language you happen to be using.  For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough.  So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

In theory you can imagine a regular expression engine where these
decisions can be postponed until we see the string we're matching
against.  IOW, your DFA ends up with state transitions for characters
specifically named, plus a state transition for "anything else that's
a letter", plus a state transition for "anything else not otherwise
specified".  Then you only need to test the letters that actually
appear in the target string, rather than all of the ones that might
appear there.

But implementing that could be quite a lot of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Reply via email to