On Nov 29, 10:51 am, MRAB <[EMAIL PROTECTED]> wrote: > John Machin wrote: > > On Nov 29, 2:47 am, Shiao <[EMAIL PROTECTED]> wrote: > >> The regex below identifies words in all languages I tested, but not in > >> Hindi: > > >> pat = re.compile('^(\w+)$', re.U) > >> ... > >> m = pat.search(l.decode('utf-8')) > > [example snipped] > >> From this is assumed that the Hindi text contains punctuation or other > >> characters that prevent the word match. > > > This appears to be a bug in Python, as others have pointed out. Two > > points not covered so far: > > Well, not so much a bug as a lack of knowledge.
It's a bug. See below. > > (1) Instead of search() with pattern ^blahblah, use match() with > > pattern blahblah -- unless it has been fixed fairly recently, search() > > doesn't notice that the ^ means that it can give up when failure > > occurs at the first try; it keeps on trying futilely at the 2nd, > > 3rd, .... positions. > > > (2) "identifies words": \w+ (when fixed) matches a sequence of one or > > more characters that could appear *anywhere* in a word in any language > > (including computer languages). So it not only matches words, it also > > matches non-words like '123' and '0x000' and '0123_' and 10 viramas -- > > in other words, you may need to filter out false positives. Also, in > > some languages (e.g. Chinese) a "word" consists of one or more > > characters and there is typically no spacing between "words"; \w+ will > > identify whole clauses or sentences. > > This is down to the definition of "word character". What is "This"? The two additional points I'm making have nothing to do with \w. > Should \w match Mc > characters? Should \w match a single character or a non-combining > character with any combining characters, ie just Lo or Lo, Lo+Mc, > Lo+Mc+Mc, etc? Huh? I thought it was settled. Read Terry Ready's latest message. Read the bug report it points to (http://bugs.python.org/issue1693050), especially the contribution from MvL. To paraphrase a remark by the timbot, Martin reads Unicode tech reports so that we don't have to. However if you are a doubter or have insomnia, read http://unicode.org/reports/tr18/ Cheers, John -- http://mail.python.org/mailman/listinfo/python-list