El 14/08/14 a las 14:33, Paul Eggert escribió: > Vincent Lefevre wrote: > >On input, using null bytes may be better if one wants to be able to > >match real replacement characters without false positives. > > Maybe, though this is no place to get fancy. It's simple to tell users "an > invalid byte acts like '?'". Simple is good. > > Anyway, this is a matter for the implementing volunteer to decide, whoever > that happens to be. >
Workaround attached. It's too slow against binary files, but I haven't found a simpler solution. What do you think? Santiago
From 7dd8d7c8682ee29bcb0ec9a64b98170fb7c6a064 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20Ruano=20Rinc=C3=B3n?= <santi...@debian.org> Date: Sat, 16 Aug 2014 14:24:43 +0200 Subject: [PATCH] Workaround to don't abort for invalid UTF8 input * src/pcresearch.c (Pexecute): When pcre_exec returns an invalid UTF8 character error, copies line_buf to an auxiliar buffer, removes invalid characters and evaluates against it. * tests/pcre-infloop: Exit status is 1 again. * tests/pcre-invalid-utf8-input: Check again if grep doesn't abort. Also cheks for match after a second invalid character in the same line. Closes http://debbugs.gnu.org/18266 --- src/pcresearch.c | 16 ++++++++++++++++ tests/pcre-infloop | 2 +- tests/pcre-invalid-utf8-input | 12 +++++++++--- 3 files changed, 26 insertions(+), 4 deletions(-) diff --git a/src/pcresearch.c b/src/pcresearch.c index 820dd00..2b81e2b 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -164,6 +164,22 @@ Pexecute (char const *buf, size_t size, size_t *match_size, e = pcre_exec (cre, extra, line_buf, line_end - line_buf, start_ofs < 0 ? 0 : start_ofs, 0, sub, sizeof sub / sizeof *sub); + + /* Workaround to don't abort for invalid multi-byte input (until + libpcre provides a better solution?) + If pcre_exec returns PCRE_ERROR_BADUTF8, copy the input, clean it + and evaluate again. */ + if (e == PCRE_ERROR_BADUTF8){ + char *line_utf8_clean = xmemdup (line_buf, line_end - line_buf); + + while (e == PCRE_ERROR_BADUTF8) { + line_utf8_clean[sub[0]] = '\0'; + + e = pcre_exec (cre, extra, line_utf8_clean, line_end - line_buf, + start_ofs < 0 ? 0 : start_ofs, 0, + sub, sizeof sub / sizeof *sub); + } + } } if (e <= 0) diff --git a/tests/pcre-infloop b/tests/pcre-infloop index 1b33e72..b92f8e1 100755 --- a/tests/pcre-infloop +++ b/tests/pcre-infloop @@ -28,6 +28,6 @@ printf 'a\201b\r' > in || framework_failure_ fail=0 LC_ALL=en_US.UTF-8 timeout 3 grep -P 'a.?..b' in -test $? = 2 || fail_ "libpcre's match function appears to infloop" +test $? = 1 || fail_ "libpcre's match function appears to infloop" Exit $fail diff --git a/tests/pcre-invalid-utf8-input b/tests/pcre-invalid-utf8-input index 913e8ee..2c6aadb 100755 --- a/tests/pcre-invalid-utf8-input +++ b/tests/pcre-invalid-utf8-input @@ -13,9 +13,15 @@ require_en_utf8_locale_ fail=0 -printf 'j\202\nj\n' > in || framework_failure_ +printf 'j\202j\202\x\njx\n' > in || framework_failure_ -LC_ALL=en_US.UTF-8 grep -P j in -test $? -eq 2 || fail=1 +LC_ALL=en_US.UTF-8 grep -P j in > out 2>&1 || fail=1 +compare in out || fail=1 +compare /dev/null err || fail=1 + +# Match after a second invalid UTF-8 character +LC_ALL=en_US.UTF-8 grep -P x in > out 2>&1 || fail=1 +compare in out || fail=1 +compare /dev/null err || fail=1 Exit $fail -- 1.7.10.4
signature.asc
Description: Digital signature