Forwarded from someone who has encountered problems posting on
linux-utf8 (who maintains list list server?):
------- Forwarded Message
I wasn't able to post to linux-utf8. Did you guys
receive my message? If I implemented a fix would
you be interested? I haven't yet...
Please let me know if you are interested in getting
this situation fixed.
Brad
From: "Chen, Brad" <[EMAIL PROTECTED]>
Sent: Saturday, June 05, 2004 4:04 PM
To: '[EMAIL PROTECTED]'
Subject: RE: grep is horriby slow in UTF-8 locales
>From the proposed patch:
- if (MB_CUR_MAX > 1 && mb_properties[beg - buf] == 0)
- continue;
+ if (MB_CUR_MAX > 1)
+ {
+ memset(&cur_state, 0, sizeof(mbstate_t));
+ if (mbrlen(beg + offset, buf + size - beg, &cur_state) < 0)
+ continue; /* It is a part of multibyte character. */
+ }
This code does not appear to be functionally equivalent to what
was there before. In the old version, mb_properties[i] would be
0 only if the byte in question was part of a multi-byte character
and not the first byte. For the cases where the new code reaches
"continue", the original code would have had mb_properties[i] == 1
and would not have behaved the same way.
Am I misreading this code?
Another thing you might want to tidy up here; mbrlen returns
a size_t which is unsigned, so a "< 0" comparison will get you
into trouble on some systems.
I confess I haven't correctness tested either version of the
code yet. I was just looking at performance. If you have a=20
favorite correctness test case please send it.
Best Wishes,
Brad Chen
Intel Corporation
SSG/Performance Tools Lab
------- End of Forwarded Message
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/