Re: [HACKERS] Better locale-specific-character-class handling for regexps

Tom Lane Sun, 04 Sep 2016 10:45:06 -0700

Heikki Linnakangas <[email protected]> writes:
> On 08/23/2016 03:54 AM, Tom Lane wrote:
>> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
>> ! whose rows correspond to character ranges that are explicitly mentioned
>> ! in the input, and whose columns correspond to sets of relevant locale
>> ! character classes.


> I think that last sentence should say "whose rows correspond to 
> character ranges that are explicitly mentioned in the *regex pattern*", 
> rather than "in the input".

OK.  The pattern is the input under consideration by regcomp.c, but I see
your point that someone might be confused.

> An example would be very helpful here.

I'll see what I can do.

> I can easily come up with examples for 
> character classes, but not for "high" character-ranges. The best I could 
> come up with is to check if a characters belongs to some special group 
> of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical 
> symbol characters.

Right, or someone might write the equivalent of [A-Z] in the Klingon
alphabet, or whatever --- it wouldn't necessarily have to use backslash
notation, it could be literal characters with large codes.

> In practice, I guess you will only see single 
> characters in the colormaprange array, although we must of course cope 
> with ranges too.

I suspect that's true; I'm not sure that there are many places in the
upper reaches of Unicode space where ranges of consecutive character
codes form natural units to search for.

>> +       /* this relies on WHITE being zero: */
>> +       memset(cm->locolormap, WHITE,
>> +                  (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));

> Is the comment correct? I don't see why this wouldn't work with "WHITE 
> != 0".

The array entries are shorts not chars, so if WHITE were say 2, memset
would fill the array with 0x0202 rather than 2.  I suppose if WHITE were
say 0x0303 then it would accidentally work, but I don't think the comment
needs to get into that.

> Perhaps "backwards" would be clearer than "downwards"?

OK.

> +1 for this patch in general. Some regression test cases would be nice.

I'm not sure how to write such tests without introducing insurmountable
platform dependencies --- particularly on platforms with weak support for
UTF8 locales, such as OS X.  All the interesting cases require knowing
what iswalpha() etc will return for some high character codes.

What I did to test it during development was to set MAX_SIMPLE_CHR to
something in the ASCII range, so that the high-character-code paths could
be tested without making any assumptions about locale classifications for
non-ASCII characters.  I'm not sure that's a helpful idea for regression
testing purposes, though.

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that.  AFAIK that test file does not
get run at all in the buildfarm or in the wild.

Thanks for reviewing!

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better locale-specific-character-class handling for regexps

Reply via email to