Hi Paul, Thanks for a lot of investigation. I have understood that we cannot generally tell whether DFA's or regex's behavior is right.
I have tested the behavior of sereral regex engines. What's interesting
is that most of results differ from others. And nobody will understand
which is right.
--
GNU grep (DFA):
$ env LANG=en_US.utf8 ./test.sh "src/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 89
4 49 | 69
5 49 | 69
6 69 | c4 b0
7 49 | c4 b1
GNU grep (regex):
$ env LANG=en_US.utf8 ./test.sh "src/grep -i" '\(\)\1' 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69 | c4 b1
5 49 | 69 | c4 b1
6 c4 b0
7 49 | 69 | c4 b1
pcregrep:
$ env LANG=en_US.utf8 ./test.sh "pcregrep -iu" 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69
5 49 | 69
6 c4 b0
7 c4 b1
Solaris grep (xpg4):
$ env LANG=ja_JP.UTF-8 ./test.sh "/usr/xpg4/bin/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 89
2 c7 88
3 c7 87 | c7 89
4 49 | 69
5 49 | 69
6 error
7 error
HP-UX grep:
$ env LANG=en_US.utf8 ./test.sh "/bin/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69
5 49 | 69
6 c4 b0
7 c4 b1
--
Norihiro
test.sh
Description: Binary data
