RE: grep is horriby slow in UTF-8 locales

Markus Kuhn Wed, 09 Jun 2004 07:55:06 -0700

Forwarded from someone who has encountered problems posting on
linux-utf8 (who maintains list list server?):


------- Forwarded Message
I wasn't able to post to linux-utf8. Did you guys
receive my message? If I implemented a fix would
you be interested? I haven't yet...

Please let me know if you are interested in getting
this situation fixed.

Brad 

From: "Chen, Brad" <[EMAIL PROTECTED]>
Sent: Saturday, June 05, 2004 4:04 PM
To: '[EMAIL PROTECTED]'
Subject: RE: grep is horriby slow in UTF-8 locales

>From the proposed patch:

-      if (MB_CUR_MAX > 1 && mb_properties[beg - buf] == 0)
-                continue;
+      if (MB_CUR_MAX > 1)
+      {
+        memset(&cur_state, 0, sizeof(mbstate_t));
+          if (mbrlen(beg + offset, buf + size - beg, &cur_state) < 0)
+              continue; /* It is a part of multibyte character.  */
+      }

This code does not appear to be functionally equivalent to what
was there before. In the old version, mb_properties[i] would be
0 only if the byte in question was part of a multi-byte character
and not the first byte. For the cases where the new code reaches
"continue", the original code would have had mb_properties[i] == 1
and would not have behaved the same way.

Am I misreading this code?

Another thing you might want to tidy up here; mbrlen returns
a size_t which is unsigned, so a "< 0" comparison will get you
into trouble on some systems.

I confess I haven't correctness tested either version of the
code yet. I was just looking at performance. If you have a=20
favorite correctness test case please send it.

Best Wishes,
Brad Chen
Intel Corporation
SSG/Performance Tools Lab

------- End of Forwarded Message


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: grep is horriby slow in UTF-8 locales

Reply via email to