Sorry, tow test cases are wrong.  It's as follows surely.

  encode() { echo "$1" | tr ABC '\357\274\241'; }
  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
  encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
  encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
         ^
  encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
         ^

We will expect none of the commands output anything, but we get 1 row in
4th.  We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.

  return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);

We must check whether it's valid at not the top but a part of last of
matched pattern.  Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'.        ^
                          ^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.

Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on.  No longer tokens are broken in 1st case.

  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"

  dfaanalyze:
   0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT

I expect below, becuase `encode ABC' is `ef bc a1'.

  dfaanalyze:
    0:ef 1:71 2:OR 3:END 4:CAT

However, It will be also difficult to fix it correctly.  Therefore,
I propose the simple fix in the patch.

Thanks,
Norihiro




Reply via email to