MRAB wrote:
Terry Reedy wrote:

I notice from the manual "All identifiers are converted into the normal form NFC while parsing; comparison of identifiers is based on NFC." If NFC used accented letters, then the issue is finesses away for European words simply because Unicode includes includes combined characters for European scripts but not for south Asian scripts.

Does that mean that the re module will need to convert both the pattern and the text to be searched into NFC form first?

The quote says that Python3 internally converts all identifiers in source code to NFC before compiling the code, so it can properly compare them. If this was purely an internal matter, this would not need to be said. I interpret the quote as a warning that a programmer who wants to compare a 3.0 string to an identifier represented as a string is responsible for making sure that *his* string is also in NFC. For instance:

ident = 3
...
if 'ident' in globals(): ...

The second ident must be NFC even if the programmer prefers and habitually writes another form because, like it or not, the first one will be turned into NFC before insertion into the code object and later into globals().

So my thought is that re should take the strings as given, but that the re doc should warn about logically equal forms not matching. (Perhaps it does already; I have not read it in years.) If a text uses a different normalization form, which some surely will, the programmer is responsible for using the same in the re pattern.

And I'm still not clear whether \w, when used on a string consisting of Lo followed by Mc, should match Lo and then Mc (one codepoint at a time) or together (one character at a time, where a character consists of some base character codepoint possibly followed by modifier codepoints).

Programs that transform text to glyphs may have to read bundles of codepoints before starting to output, but my guess is that re should do the simplest thing and match codepoint by codepoint, assuming that is what it currently does. I gather that would just mean expanding the current definition of word char. But I would look at TR18 and see what Martin says.

I ask because I'm working on the re module at the moment.

Great.  I *think* that the change should be fairly simple

Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to