* On Monday 2005-08-01 at 09:12:03 +0900, KIMURA Koichi wrote: > > I think I found bug of dfa of gawk.
You mean grep? (Both use a dfa.) > Situation: > In Japanese ShiftJIS locale, half-witdth katakana in character class > does not match appropriately. > > Reproduce: > set LANG=ja_JP.SJIS > export LANG > echo ABCDE | grep '/[A-E]\+/p' > > Actually, A B C D E is half-width katakana character. > (data to reprodcue is appended at end of this mail (uuencoded SJIS data)) > > Result: > nothig printed. > begin 644 testkana.sh > M<V5T($Q!3D<]:F%?2E`N4TI)4PIE>'!O<[EMAIL PROTECTED];F]T('!R:6YT"F5C! > <:&[EMAIL PROTECTED];[EMAIL PROTECTED]"!G<F5P("<O6[$MM5U<*R\G"@``( > `` > end $ hexdump -C testkana.sh 00000000 73 65 74 20 4c 41 4e 47 3d 6a 61 5f 4a 50 2e 53 |set LANG=ja_JP.S| 00000010 4a 49 53 0a 65 78 70 6f 72 74 20 4c 41 4e 47 0a |JIS.export LANG.| 00000020 23 6e 6f 74 20 70 72 69 6e 74 0a 65 63 68 6f 20 |#not print.echo | 00000030 b1 b2 b3 b4 b5 20 7c 20 67 72 65 70 20 27 2f 5b |..... | grep '/[| 00000040 b1 2d b5 5d 5c 2b 2f 27 0a |.-.]\+/'.| This shell script has several problems: -- it shouldn't be "set LANG=ja_JP.SJIS" but just "LANG=ja_JP.SJIS" (better yet, use LC_ALL instead to be sure to override any other environment variable); -- there shouldn't be slashes around the regular expression (that being awk or sed syntax). Fixing those two problems, I do get a match using current CVS grep. However, using a more recent version of regex.c et al. (as recently discussed on the mailing list), I get a "grep: Invalid collation character" error with an exit code of 2. Repeating an equivalent experiment with UTF-8, it works fine no matter what version of grep I use: $ echo 'アイウエオ' | LC_ALL=ja_JP.utf8 grep '[ア-オ]\+' アイウエオ Strangely, this $ echo 'アイウエオ' | LC_ALL=en_US.utf8 grep '[ア-オ]\+' only works with the recent regex.c and produces the same error as above without it. (I.e., just the opposite as with ja_JP.SJIS). Is any UTF-8 locale supposed to know about the collation order of languages other than its main one (here en_US about ja_JP)?