Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

yury.t Sat, 24 Aug 2019 07:39:37 -0700

Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue.  It 
seems to be a common problem on regex + multibyte in C.  (e.g. 
https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)


On Wed, Aug 21, 2019 at 12:58:04PM +0000, tpt...@tuta.io 
<mailto:tpt...@tuta.io> wrote:
> - [１] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

Actually, it becomes [\xef\xbc\x91].  That's why it matches with U+Fxxx (starts 
with \xef in UTF-8).  And without ^, it matches partial byte of a character, 
U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.

I'm not familiar with C and don't know whether pcre or \k solve this issue, but 
it might hard to fix if the root cause is how C handles multibyte strings.
_______________________________________________
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

Reply via email to