regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

yury.t Wed, 21 Aug 2019 06:06:58 -0700

Some regular expression returns incorrect results if the pattern contains 
multibyte characters in square brackets.  The following bracket expression 
matches subjects not starting with `[１-９]` and returns more results than the 
parenthesis expression.


(Please note that digits are full width, unicode characters.)




    notmuch count -- 'subject:"/^[１-９]/"' # 961


    notmuch count -- 'subject:"/^(１|２|３|４|５|６|７|８|９)/"' # 32





Somehow non-ascii characters in brackets match with any characters start with 
same hex code point.  For example:





- [１] (U+FF11) is treated as [\x{F000}-\x{FFFF}]


- ^[倀] (U+5000), ^[啕] (U+5555) and ^[忿] (U+5fff) return same results since they 
are all "U+5xxx".


Without ^, their results are vary but still contain unrelated subjects.





And curly brackets for repetition also have weird behavior.


If there are two emails whose subject is (A) "１人" and (B) "１２人":



- ^(１|２...|９)人 - match A, unmatch B (expected)


- ^(１|２...|９){2}人 - unmatch A, match B (expected)


- ^[１-９]人 and ^[１-９]{2}人 - unmatch both


- ^[１-９]{3}人, {4} and {5} - match A, unmatch B


- ^[１-９]{6}人, {7} and {8} - unmatch A, match B





As noted in manpage of notmuch-search-terms, I surely wrap regular expression 
with double quotes and entire query with single quotes.  I also 
increase/decrease $XAPIAN_CJK_NGRAM and rebuild index, but the situation won't 
change.





_______________________________________________
notmuch mailing list
[email protected]
https://notmuchmail.org/mailman/listinfo/notmuch

regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

Reply via email to