Re: [PATCH 0/2] Set PCRE_UTF8 flag correctly for UTF-8 locales

Jim Meyering Wed, 03 Oct 2012 03:10:46 -0700

Paolo Bonzini wrote:
> This is the patch attached to https://bugzilla.redhat.com/683753
> and http://savannah.gnu.org/patch/?3934, with testcases.
>
> Paolo
>
> Paolo Bonzini (1):
>   tests: include UTF-8 testcases for grep -P
>
> Petr Pisar (1):
>   pcresearch: set UTF-8 flag correctly for UTF-8 locales
>
>  NEWS              |  6 ++++++
>  src/pcresearch.c  |  8 ++++++++
>  tests/Makefile.am |  1 +
>  tests/pcre-utf8   | 33 +++++++++++++++++++++++++++++++++
>  4 file modificati, 48 inserzioni(+)
>  create mode 100755 tests/pcre-utf8


Thanks for the quick work, Paolo.
I will push this follow-on patch shortly, along with one more
to factor out the now-duplicate STREQ definition.

>From 9df414a75f101a1f7f25c5850d5cfc2e242f6ff8 Mon Sep 17 00:00:00 2001
From: Jim Meyering <[email protected]>
Date: Wed, 3 Oct 2012 12:08:31 +0200
Subject: [PATCH] maint: correct syntax-check failures; adjust NEWS

* tests/pcre-utf8: Reverse order of compare arguments.
Remove all copyright year numbers except 2012.
Use skip_ "diagnostic...", rather than a bare "exit 77".
* NEWS: Start with a concise description of the bug.
* src/pcresearch.c (STREQ): Define, so that we can...
(Pcompile): use STREQ, not strcmp.
---
 NEWS             |  9 +++++----
 src/pcresearch.c |  4 +++-
 tests/pcre-utf8  | 13 +++++++------
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/NEWS b/NEWS
index bc669b9..052cd81 100644
--- a/NEWS
+++ b/NEWS
@@ -4,10 +4,11 @@ GNU grep NEWS                                    -*- outline 
-*-

 ** Bug fixes

-  While multi-byte mode is only supported by PCRE with UTF-8 locales,
-  grep did not activate it.  This can cause failures to match multibyte
-  characters against some regular expressions, especially those including
-  the '.' or '\p' metacharacters.
+  grep -P could misbehave.  While multi-byte mode is only supported by PCRE
+  with UTF-8 locales, grep did not activate it.  This would cause failures
+  to match multibyte characters against some regular expressions, especially
+  those including the '.' or '\p' metacharacters.
+

 * Noteworthy changes in release 2.14 (2012-08-20) [stable]

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 3539b58..a15f598 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -29,6 +29,8 @@
 # include <langinfo.h>
 #endif

+#define STREQ(a, b) (strcmp (a, b) == 0)
+
 #if HAVE_LIBPCRE
 /* Compiled internal form of a Perl regular expression.  */
 static pcre *cre;
@@ -55,7 +57,7 @@ Pcompile (char const *pattern, size_t size)
   char const *pnul;

 #if defined HAVE_LANGINFO_CODESET
-  if (!strcmp(nl_langinfo(CODESET), "UTF-8"))
+  if (STREQ (nl_langinfo (CODESET), "UTF-8"))
     flags |= PCRE_UTF8;
 #endif

diff --git a/tests/pcre-utf8 b/tests/pcre-utf8
index b86b114..04146ec 100755
--- a/tests/pcre-utf8
+++ b/tests/pcre-utf8
@@ -1,7 +1,7 @@
 #! /bin/sh
 # Ensure that, with -P, Unicode \p{} symbols are correctly matched.
 #
-# Copyright (C) 2001, 2006, 2009-2012 Free Software Foundation, Inc.
+# Copyright (C) 2012 Free Software Foundation, Inc.
 #
 # Copying and distribution of this file, with or without modification,
 # are permitted in any medium without royalty provided the copyright
@@ -13,21 +13,22 @@ require_en_utf8_locale_

 fail=0

-echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' || exit 77
+echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' \
+  || skip_ 'PCRE support is compiled out'

 euro='\xe2\x82\xac euro'
 printf "$euro\\n" > in || framework_failure_

 LC_ALL=en_US.UTF-8 grep -P '^\p{S}' in > out || fail=1
-compare out in || fail=1
+compare in out || fail=1

 LC_ALL=en_US.UTF-8 grep -P '^. euro$' in > out2 || fail=1
-compare out2 in || fail=1
+compare in out2 || fail=1

 LC_ALL=en_US.UTF-8 grep -oP '. euro' in > out3 || fail=1
-compare out3 in || fail=1
+compare in out3 || fail=1

 LC_ALL=en_US.UTF-8 grep -P '^\P{S}' in > out4
-compare out4 /dev/null || fail=1
+compare /dev/null out4 || fail=1

 Exit $fail
--
1.7.12.1.382.gb0576a6

Re: [PATCH 0/2] Set PCRE_UTF8 flag correctly for UTF-8 locales

Reply via email to