Hi Jim, first of all, thanks for your review.
On Aug 16 07:42, Jim Meyering wrote:
> Hi Corina,
>
> Thanks a lot for the patcb. It is almost perfect.
>
> [ - the git one-line summary should be readable.
> - comment nit: s/ as/ as a/
> - a style issue: we want curly braces around the 1-line
> else body in the first #ifdef block
> - please attribute the reporter (or a list URL) in the commit log
> ]
I attached a new patch which hopefully fixes these points. It also
contains a suggestion for a NEWS entry.
> Do any of the existing tests trigger this malfunction?
Unfortunately not. The testsuite runs fine except for a few failures
in fmbtest, but they are not related to this problem (...but I didn't
took another look into them, either)
> If not, can you create a small example that triggers the
> problem on cygwin? Even better would be the addition of a new
> script in tests/, which is required for any bug-fix patch.
I'm not quite sure how the test script as a whole should look like, but
the test content is very simple; just create input which contains a
valid 4 byte UTF-8 character. Without the patch, grep SEGVs on Cygwin,
with the patch it runs fine:
Before:
$ printf "\xf0\x90\x90\x85\n" | ./grep -i blurb
Segmentation fault (core dumped)
$
After:
$ printf "\xf0\x90\x90\x85\n" | ./grep -i blurb
$
However, I just noticed that it's still not possible to grep for such
a character. Apparently the regex compiler does not recognize UTF-16
surrogate pairs either. Grep fails, grep -F works:
$ printf "\xf0\x90\x90\x85\n" | grep '𐐅'
$ printf "\xf0\x90\x90\x85\n" | grep -F '𐐅'
𐐅
Not sure if that's such a terrible restriction, though...
> Also, it'd be great if you would add a NEWS entry that
> describes your fix.
My suggestion for a NEWS entry is part of the attached patch.
> That said, there's no pressure.
> If you can tell me how to reproduce the failure, I'll
> make time to write both the test and NEWS addition, and
> amend them onto your patch.
>
> PS. Your timing is great. I'm planning to make a release pretty soon.
Sounds good :)
Thanks,
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
From 9d1fe22b57ee1a7b80c4e44f4fef786fbee77575 Mon Sep 17 00:00:00 2001 From: Corinna Vinschen <[email protected]> Date: Fri, 16 Aug 2013 20:50:12 +0200 Subject: [PATCH] fix Cygwin UTF-16 surrogate-pair handling with -i Reported by: Jim Burwell Patch by: Corinna Vinschen Signed-off-by: Corinna Vinschen <[email protected]> --- NEWS | 6 ++++++ src/searchutils.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 63 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index 8796a1b..85081c3 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,12 @@ GNU grep NEWS -*- outline -*- ** Bug fixes + grep -i would segfault on systems using UTF-16 based wchar_t (Cygwin) + when it performs lowercasing of the input string, and the input string + contains valid 4 byte UTF-8 sequences. The conversion to wchar_t + and back to a UTF-8 multibyte string didn't take surrogate pairs into + account. + grep -E would segfault when given a regexp like '([^.]*[M]){1,2}' for any multibyte character M. [bug introduced in grep-2.6, which would segfault, but 2.7 and 2.8 had no problem, and 2.9 through 2.14 would diff --git a/src/searchutils.c b/src/searchutils.c index 160f80d..cb422ae 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -111,6 +111,37 @@ mbtolower (const char *beg, size_t *n, mb_len_map_t **len_map_p) { wchar_t wc; size_t mbclen = mbrtowc (&wc, beg, end - beg, &is); +#ifdef __CYGWIN__ + /* Handle a UTF-8 sequence for a character beyond the base plane. + Cygwin's wchar_t is UTF-16, as in the underlying OS. This + results in surrogate pairs which need some extra attention. */ + wint_t wci = 0; + if (mbclen == 3 && (wc & 0xdc00) == 0xd800) + { + /* We got the start of a 4 byte UTF-8 sequence. This is returned + as a UTF-16 surrogate pair. The first call to mbrtowc returned 3 + and wc has been set to a high surrogate value, now we're going + to fetch the matching low surrogate. This second call to mbrtowc + is supposed to return 1 to complete the 4 byte UTF-8 sequence. */ + wchar_t wc_2; + size_t mbclen_2 = mbrtowc (&wc_2, beg + mbclen, end - beg - mbclen, + &is); + if (mbclen_2 == 1 && (wc_2 & 0xdc00) == 0xdc00) + { + /* Match. Convert this to a 4 byte wint_t which constitutes + a 32 bit UTF-32 value. */ + wci = ( (((wint_t) (wc - 0xd800)) << 10) + | ((wint_t) (wc_2 - 0xdc00))) + + 0x10000; + ++mbclen; + } + else + { + /* Invalid UTF-8 sequence. */ + mbclen = (size_t) -1; + } + } +#endif if (outlen + mb_cur_max >= outalloc) { size_t dm = m - len_map; @@ -132,8 +163,33 @@ mbtolower (const char *beg, size_t *n, mb_len_map_t **len_map_p) } else { + size_t ombclen; beg += mbclen; - size_t ombclen = wcrtomb (p, towlower ((wint_t) wc), &os); +#ifdef __CYGWIN__ + /* Handle Unicode characters beyond the base plane. */ + if (mbclen == 4) + { + /* towlower, taking wint_t (4 bytes), handles UCS-4 values. */ + wci = towlower (wci); + if (wci >= 0x10000) + { + wci -= 0x10000; + wc = (wci >> 10) | 0xd800; + /* No need to check the return value. When reading the + high surrogate, the return value will be 0 and only the + mbstate indicates that we're in the middle of reading a + surrogate pair. The next wcrtomb call reading the low + surrogate will then return 4 and reset the mbstate. */ + wcrtomb (p, wc, &os); + wc = (wci & 0x3ff) | 0xdc00; + } + else + wc = (wchar_t) wci; + ombclen = wcrtomb (p, wc, &os); + } + else +#endif + ombclen = wcrtomb (p, towlower ((wint_t) wc), &os); *m = mbclen - ombclen; memset (m + 1, 0, ombclen - 1); m += ombclen; -- 1.8.1.4
pgpHrYzIqe4AW.pgp
Description: PGP signature
