bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA

Norihiro Tanaka Wed, 30 Apr 2014 15:18:01 -0700

Sorry, tow test cases are wrong.  It's as follows surely.

  encode() { echo "$1" | tr ABC '\357\274\241'; }
  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
  encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
  encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
         ^
  encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
         ^


We will expect none of the commands output anything, but we get 1 row in
4th.  We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.

  return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);

We must check whether it's valid at not the top but a part of last of
matched pattern.  Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'.        ^
                          ^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.

Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on.  No longer tokens are broken in 1st case.

  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"

  dfaanalyze:
   0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT

I expect below, becuase `encode ABC' is `ef bc a1'.

  dfaanalyze:
    0:ef 1:71 2:OR 3:END 4:CAT

However, It will be also difficult to fix it correctly.  Therefore,
I propose the simple fix in the patch.

Thanks,
Norihiro

bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA

Reply via email to