Re: grep dfa bug

Charles Levert Mon, 01 Aug 2005 00:03:39 -0700

* On Monday 2005-08-01 at 09:12:03 +0900, KIMURA Koichi wrote:
> 
> I think I found bug of dfa of gawk.


You mean grep?  (Both use a dfa.)


> Situation:
> In Japanese ShiftJIS locale, half-witdth katakana in character class
> does not match appropriately.
> 
> Reproduce:
> set LANG=ja_JP.SJIS
> export LANG
> echo ABCDE | grep '/[A-E]\+/p'
> 
> Actually, A B C D E is half-width katakana character.
> (data to reprodcue is appended at end of this mail (uuencoded SJIS data))
> 
> Result:
> nothig printed.

 
> begin 644 testkana.sh
> M<V5T($Q!3D<]:F%?2E`N4TI)4PIE>'!O<[EMAIL PROTECTED];F]T('!R:6YT"F5C!
> <:&[EMAIL PROTECTED];[EMAIL PROTECTED]"!G<F5P("<O6[$MM5U<*R\G"@``(
> ``
> end

$ hexdump -C testkana.sh
00000000  73 65 74 20 4c 41 4e 47  3d 6a 61 5f 4a 50 2e 53  |set LANG=ja_JP.S|
00000010  4a 49 53 0a 65 78 70 6f  72 74 20 4c 41 4e 47 0a  |JIS.export LANG.|
00000020  23 6e 6f 74 20 70 72 69  6e 74 0a 65 63 68 6f 20  |#not print.echo |
00000030  b1 b2 b3 b4 b5 20 7c 20  67 72 65 70 20 27 2f 5b  |..... | grep '/[|
00000040  b1 2d b5 5d 5c 2b 2f 27  0a                       |.-.]\+/'.|

This shell script has several problems:

   -- it shouldn't be "set LANG=ja_JP.SJIS"
      but just "LANG=ja_JP.SJIS" (better yet,
      use LC_ALL instead to be sure to override
      any other environment variable);

   -- there shouldn't be slashes around the
      regular expression (that being awk or
      sed syntax).

Fixing those two problems, I do get a match
using current CVS grep.

However, using a more recent version of
regex.c et al. (as recently discussed on the
mailing list), I get a "grep: Invalid collation
character" error with an exit code of 2.

Repeating an equivalent experiment with UTF-8, it
works fine no matter what version of grep I use:

   $ echo 'ｱｲｳｴｵ' | LC_ALL=ja_JP.utf8 grep '[ｱ-ｵ]\+'
   ｱｲｳｴｵ

Strangely, this

   $ echo 'ｱｲｳｴｵ' | LC_ALL=en_US.utf8 grep '[ｱ-ｵ]\+'

only works with the recent regex.c and produces
the same error as above without it.
(I.e., just the opposite as with ja_JP.SJIS).

Is any UTF-8 locale supposed to know about the
collation order of languages other than its
main one (here en_US about ja_JP)?

Re: grep dfa bug

Reply via email to