El 14/08/14 a las 14:33, Paul Eggert escribió:
> Vincent Lefevre wrote:
> >On input, using null bytes may be better if one wants to be able to
> >match real replacement characters without false positives.
> 
> Maybe, though this is no place to get fancy.  It's simple to tell users "an
> invalid byte acts like '?'".  Simple is good.
> 
> Anyway, this is a matter for the implementing volunteer to decide, whoever
> that happens to be.
> 

Workaround attached. It's too slow against binary files, but I haven't
found a simpler solution.

What do you think?

Santiago
From 7dd8d7c8682ee29bcb0ec9a64b98170fb7c6a064 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Santiago=20Ruano=20Rinc=C3=B3n?= <santi...@debian.org>
Date: Sat, 16 Aug 2014 14:24:43 +0200
Subject: [PATCH] Workaround to don't abort for invalid UTF8 input

* src/pcresearch.c (Pexecute): When pcre_exec returns an invalid
UTF8 character error, copies line_buf to an auxiliar buffer,
removes invalid characters and evaluates against it.
* tests/pcre-infloop: Exit status is 1 again.
* tests/pcre-invalid-utf8-input: Check again if grep doesn't
abort. Also cheks for match after a second invalid character
in the same line.

Closes http://debbugs.gnu.org/18266
---
 src/pcresearch.c              |   16 ++++++++++++++++
 tests/pcre-infloop            |    2 +-
 tests/pcre-invalid-utf8-input |   12 +++++++++---
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 820dd00..2b81e2b 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -164,6 +164,22 @@ Pexecute (char const *buf, size_t size, size_t *match_size,
       e = pcre_exec (cre, extra, line_buf, line_end - line_buf,
                      start_ofs < 0 ? 0 : start_ofs, 0,
                      sub, sizeof sub / sizeof *sub);
+
+      /* Workaround to don't abort for invalid multi-byte input (until
+         libpcre provides a better solution?)
+         If pcre_exec returns PCRE_ERROR_BADUTF8, copy the input, clean it
+         and evaluate again. */
+      if (e == PCRE_ERROR_BADUTF8){
+        char *line_utf8_clean = xmemdup (line_buf, line_end - line_buf);
+
+        while (e == PCRE_ERROR_BADUTF8) {
+          line_utf8_clean[sub[0]] = '\0';
+
+          e = pcre_exec (cre, extra, line_utf8_clean, line_end - line_buf,
+                         start_ofs < 0 ? 0 : start_ofs, 0,
+                         sub, sizeof sub / sizeof *sub);
+        }
+      }
     }
 
   if (e <= 0)
diff --git a/tests/pcre-infloop b/tests/pcre-infloop
index 1b33e72..b92f8e1 100755
--- a/tests/pcre-infloop
+++ b/tests/pcre-infloop
@@ -28,6 +28,6 @@ printf 'a\201b\r' > in || framework_failure_
 fail=0
 
 LC_ALL=en_US.UTF-8 timeout 3 grep -P 'a.?..b' in
-test $? = 2 || fail_ "libpcre's match function appears to infloop"
+test $? = 1 || fail_ "libpcre's match function appears to infloop"
 
 Exit $fail
diff --git a/tests/pcre-invalid-utf8-input b/tests/pcre-invalid-utf8-input
index 913e8ee..2c6aadb 100755
--- a/tests/pcre-invalid-utf8-input
+++ b/tests/pcre-invalid-utf8-input
@@ -13,9 +13,15 @@ require_en_utf8_locale_
 
 fail=0
 
-printf 'j\202\nj\n' > in || framework_failure_
+printf 'j\202j\202\x\njx\n' > in || framework_failure_
 
-LC_ALL=en_US.UTF-8 grep -P j in
-test $? -eq 2 || fail=1
+LC_ALL=en_US.UTF-8 grep -P j in > out 2>&1 || fail=1
+compare in out || fail=1
+compare /dev/null err || fail=1
+
+# Match after a second invalid UTF-8 character
+LC_ALL=en_US.UTF-8 grep -P x in > out 2>&1 || fail=1
+compare in out || fail=1
+compare /dev/null err || fail=1
 
 Exit $fail
-- 
1.7.10.4

Attachment: signature.asc
Description: Digital signature

Reply via email to