On Wednesday, July 19, 2017 at 1:57:47 AM UTC-5, Steven D'Aprano wrote: > On Wed, 19 Jul 2017 17:51:49 +1200, Gregory Ewing wrote: > > > Chris Angelico wrote: > >> Once you NFC or NFD normalize both strings, identical strings will > >> generally have identical codepoints... You should then be able to use > >> normal regular expressions to match correctly. > > > > Except that if you want to match a set of characters, > > you can't reliably use [...], you would have to write them out as > > alternatives in case some of them take up more than one code point. > > Good point! > > A quibble -- there's no "in case" here, since you, the > programmer, will always know whether they have a single > code point form or not. If you're unsure, look it up, or > call unicodedata.normalize(). > > (Yeah, right, like the average coder will remember to do this...) > > Nevertheless, although it might be annoying and tricky, > regexes *are* flexible enough to deal with this problem. > After all, you can't use [th] to match "th" as a unit > either, and regex set character set notation [abcd] is > logically equivalent to (a|b|c|d).
If the intention is to match the two-character-string "th", then the obvious solution would be to wrap the substring into a matching or non-matching group: pattern = r'(?:th)' Though i suppose one could abuse the character-set syntax by doing something like: pattern = r'[t][h]' However, even the first example (using a group) is superfluous if "th" is the only substring to be matched. Employing the power of grouping is only necessary in more complex patterns. -- https://mail.python.org/mailman/listinfo/python-list