Re: UTF-16 surrogate pair handling in grep -i option

Corinna Vinschen Fri, 16 Aug 2013 12:01:28 -0700

Hi Jim,

first of all, thanks for your review.

On Aug 16 07:42, Jim Meyering wrote:
> Hi Corina,
> 
> Thanks a lot for the patcb.  It is almost perfect.
> 
> [ - the git one-line summary should be readable.
>   - comment nit: s/  as/  as a/
>   - a style issue: we want curly braces around the 1-line
>   else body in the first #ifdef block
>   - please attribute the reporter (or a list URL) in the commit log
> ]

I attached a new patch which hopefully fixes these points.  It also
contains a suggestion for a NEWS entry.

> Do any of the existing tests trigger this malfunction?

Unfortunately not.  The testsuite runs fine except for a few failures
in fmbtest, but they are not related to this problem (...but I didn't
took another look into them, either)

> If not, can you create a small example that triggers the
> problem on cygwin?  Even better would be the addition of a new
> script in tests/, which is required for any bug-fix patch.

I'm not quite sure how the test script as a whole should look like, but
the test content is very simple;  just create input which contains a
valid 4 byte UTF-8 character.  Without the patch, grep SEGVs on Cygwin,
with the patch it runs fine:

  Before:

    $ printf "\xf0\x90\x90\x85\n" | ./grep -i blurb
    Segmentation fault (core dumped)
    $

  After:

    $ printf "\xf0\x90\x90\x85\n" | ./grep -i blurb
    $

However, I just noticed that it's still not possible to grep for such
a character.  Apparently the regex compiler does not recognize UTF-16
surrogate pairs either.  Grep fails, grep -F works:

    $ printf "\xf0\x90\x90\x85\n" | grep '𐐅'
    $ printf "\xf0\x90\x90\x85\n" | grep -F '𐐅'
    𐐅

Not sure if that's such a terrible restriction, though...

> Also, it'd be great if you would add a NEWS entry that
> describes your fix.

My suggestion for a NEWS entry is part of the attached patch.

> That said, there's no pressure.
> If you can tell me how to reproduce the failure, I'll
> make time to write both the test and NEWS addition, and
> amend them onto your patch.
> 
> PS. Your timing is great. I'm planning to make a release pretty soon.

Sounds good :)

Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer
Red Hat

From 9d1fe22b57ee1a7b80c4e44f4fef786fbee77575 Mon Sep 17 00:00:00 2001
From: Corinna Vinschen <[email protected]>
Date: Fri, 16 Aug 2013 20:50:12 +0200
Subject: [PATCH] fix Cygwin UTF-16 surrogate-pair handling with -i

  Reported by: Jim Burwell
  Patch by:    Corinna Vinschen

Signed-off-by: Corinna Vinschen <[email protected]>
---
 NEWS              |  6 ++++++
 src/searchutils.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index 8796a1b..85081c3 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,12 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 ** Bug fixes
 
+  grep -i would segfault on systems using UTF-16 based wchar_t (Cygwin)
+  when it performs lowercasing of the input string, and the input string
+  contains valid 4 byte UTF-8 sequences.  The conversion to wchar_t
+  and back to a UTF-8 multibyte string didn't take surrogate pairs into
+  account.
+
   grep -E would segfault when given a regexp like '([^.]*[M]){1,2}'
   for any multibyte character M. [bug introduced in grep-2.6, which would
   segfault, but 2.7 and 2.8 had no problem, and 2.9 through 2.14 would
diff --git a/src/searchutils.c b/src/searchutils.c
index 160f80d..cb422ae 100644
--- a/src/searchutils.c
+++ b/src/searchutils.c
@@ -111,6 +111,37 @@ mbtolower (const char *beg, size_t *n, mb_len_map_t 
**len_map_p)
     {
       wchar_t wc;
       size_t mbclen = mbrtowc (&wc, beg, end - beg, &is);
+#ifdef __CYGWIN__
+      /* Handle a UTF-8 sequence for a character beyond the base plane.
+         Cygwin's wchar_t is UTF-16, as in the underlying OS.  This
+         results in surrogate pairs which need some extra attention.  */
+      wint_t wci = 0;
+      if (mbclen == 3 && (wc & 0xdc00) == 0xd800)
+        {
+          /* We got the start of a 4 byte UTF-8 sequence.  This is returned
+             as a UTF-16 surrogate pair.  The first call to mbrtowc returned 3
+             and wc has been set to a high surrogate value, now we're going
+             to fetch the matching low surrogate.  This second call to mbrtowc
+             is supposed to return 1 to complete the 4 byte UTF-8 sequence.  */
+          wchar_t wc_2;
+          size_t mbclen_2 = mbrtowc (&wc_2, beg + mbclen, end - beg - mbclen,
+                                     &is);
+          if (mbclen_2 == 1 && (wc_2 & 0xdc00) == 0xdc00)
+            {
+              /* Match.  Convert this to a 4 byte wint_t which constitutes
+                 a 32 bit UTF-32 value.  */
+              wci = ( (((wint_t) (wc - 0xd800)) << 10)
+                     | ((wint_t) (wc_2 - 0xdc00)))
+                    + 0x10000;
+              ++mbclen;
+            }
+          else
+            {
+              /* Invalid UTF-8 sequence.  */
+              mbclen = (size_t) -1;
+            }
+        }
+#endif
       if (outlen + mb_cur_max >= outalloc)
         {
           size_t dm = m - len_map;
@@ -132,8 +163,33 @@ mbtolower (const char *beg, size_t *n, mb_len_map_t 
**len_map_p)
         }
       else
         {
+          size_t ombclen;
           beg += mbclen;
-          size_t ombclen = wcrtomb (p, towlower ((wint_t) wc), &os);
+#ifdef __CYGWIN__
+          /* Handle Unicode characters beyond the base plane.  */
+          if (mbclen == 4)
+            {
+              /* towlower, taking wint_t (4 bytes), handles UCS-4 values.  */
+              wci = towlower (wci);
+              if (wci >= 0x10000)
+                {
+                  wci -= 0x10000;
+                  wc = (wci >> 10) | 0xd800;
+                  /* No need to check the return value.  When reading the
+                     high surrogate, the return value will be 0 and only the
+                     mbstate indicates that we're in the middle of reading a
+                     surrogate pair.  The next wcrtomb call reading the low
+                     surrogate will then return 4 and reset the mbstate.  */
+                  wcrtomb (p, wc, &os);
+                  wc = (wci & 0x3ff) | 0xdc00;
+                }
+              else
+                wc = (wchar_t) wci;
+              ombclen = wcrtomb (p, wc, &os);
+            }
+          else
+#endif
+          ombclen = wcrtomb (p, towlower ((wint_t) wc), &os);
           *m = mbclen - ombclen;
           memset (m + 1, 0, ombclen - 1);
           m += ombclen;
-- 
1.8.1.4

pgpHrYzIqe4AW.pgp
Description: PGP signature

Re: UTF-16 surrogate pair handling in grep -i option

Reply via email to