On Nov 26, 2007 4:27 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > myASCIIRegex = re.compile('[A-Z]') > > myUniRegex = re.compile(u'\u2013') # en-dash > > > > then read the source file into a unicode string with codecs.read(), > > then expect re to match against the unicode string using either of > > those regexes if the string contains the relevant chars? Or do I need > > to do make all my regex patterns unicode strings, with u""? > > It will work fine if the regular expression restricts itself to ASCII, > and doesn't rely on any of the locale-specific character classes (such > as \w). If it's beyond ASCII, or does use such escapes, you better make > it a Unicode expression.
yes, you have to be careful when writing unicode-senstive regular expressions: http://effbot.org/zone/unicode-objects.htm "You can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for \w (and \s, and \b), use the "(?u)" flag prefix, or the re.UNICODE flag: pattern = re.compile("(?u)pattern") pattern = re.compile("pattern", re.UNICODE) " > > I'm not actually sure what precisely the semantics is when you match > an expression compiled from a byte string against a Unicode string, > or vice versa. I believe it operates on the internal representation, > so \xf6 in a byte string expression matches with \u00f6 in a Unicode > string; it won't try to convert one into the other. > > Regards, > Martin > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list