Re: Unicode regex and Hindi language

Terry Reedy Sat, 29 Nov 2008 14:44:18 -0800

MRAB wrote:

Terry Reedy wrote:

I notice from the manual "All identifiers are converted into thenormal form NFC while parsing; comparison of identifiers is based onNFC." If NFC used accented letters, then the issue is finesses awayfor European words simply because Unicode includes includes combinedcharacters for European scripts but not for south Asian scripts.
Does that mean that the re module will need to convert both the patternand the text to be searched into NFC form first?

The quote says that Python3 internally converts all identifiers insource code to NFC before compiling the code, so it can properly comparethem. If this was purely an internal matter, this would not need to besaid. I interpret the quote as a warning that a programmer who wants tocompare a 3.0 string to an identifier represented as a string isresponsible for making sure that *his* string is also in NFC. For instance:


ident = 3
...
if 'ident' in globals(): ...

The second ident must be NFC even if the programmer prefers andhabitually writes another form because, like it or not, the first onewill be turned into NFC before insertion into the code object and laterinto globals().

So my thought is that re should take the strings as given, but that there doc should warn about logically equal forms not matching. (Perhapsit does already; I have not read it in years.) If a text uses adifferent normalization form, which some surely will, the programmer isresponsible for using the same in the re pattern.

And I'm still not clearwhether \w, when used on a string consisting of Lo followed by Mc,should match Lo and then Mc (one codepoint at a time) or together (onecharacter at a time, where a character consists of some base charactercodepoint possibly followed by modifier codepoints).

Programs that transform text to glyphs may have to read bundles ofcodepoints before starting to output, but my guess is that re should dothe simplest thing and match codepoint by codepoint, assuming that iswhat it currently does. I gather that would just mean expanding thecurrent definition of word char. But I would look at TR18 and see whatMartin says.

I ask because I'm working on the re module at the moment.


Great.  I *think* that the change should be fairly simple

Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode regex and Hindi language

Reply via email to