New submission from Shriramana Sharma <[email protected]>:
Code:
import re
cons_taml = "[கஙசஞடணதநபமயரலவழளறன]"
print(re.findall("\\b" + cons_taml + "ை|ஐ", "ஐவர் பையன் இசை சிவிகை இல்லை இவ்ஐ"))
cons_deva = "[कखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह]"
print(re.findall("\\b" + cons_deva + "ै|ऐ", "ऐषमः तैलम् ईडै समीशै ईक्षै ईक्ऐ"))
Specs:
Kubuntu Xenial 64 bit
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Actual Output:
['ஐ', 'பை', 'கை', 'லை', 'ஐ']
['ऐ', 'तै', 'शै', 'षै', 'ऐ']
Expected Output:
['ஐ', 'பை']
['ऐ', 'तै']
Rationale:
The formulated RE desires to identify words *starting* with the vowel /ai/
(\u0BC8 ை in Tamil script and \u0948 ै in Devanagari as vowel sign or \u0B90 ஐ
\u0910 ऐ as independent vowel). ஐவர் பையன் and ऐषमः तैलम् are the only words
fitting this criterion. \b is defined to mark a word boundary and is here
applied at the beginning of the RE.
Observation:
There seems to be some assumption that only GC=Lo characters constitute words.
Hence the false positives at ச ி வ ி (க ை) and स म ी (श ै) where the ி and ी
are vowel signs, and இ ல ் (ல ை) and ई क ् (ष ै) where the ் and ् are virama
characters or vowel cancelling signs.
In Indic, such GC=Mc and GC=Mn characters are inalienable parts of words. They
should be properly identified as parts of words and no word boundary answering
to \b should be generated at their positions.
----------
components: Regular Expressions
messages: 307430
nosy: ezio.melotti, jamadagni, mrabarnett
priority: normal
severity: normal
status: open
title: \b reports false-positives in Indic strings involving combining marks
type: behavior
versions: Python 3.5
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue32198>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com