> Hello,
>
> I found a strange bug in grep.
> some Japanese runes does not match ‘[^0-9]’.
>
> for example ‘ま' (307e) and ‘み’(307f).
>
i can't replicate here with 9atom's fixes to grep.
with the same t3 file as you've got,
; wc -l /tmp/t3
21 /tmp/t3
; grep -v '^[0-9]' /tmp/t3 | wc -l
21
i have some other differences in grep, including -I (same
as -i, except fold runes), but i think the differences in
comp.c are what cause the bug. in particular, you really
need that 0xffff entry in the tabs.
/n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
{
0x007f,
0x07ff,
+ 0xffff,
};
Rune tab2[] =
{
0x003f,
0x0fff,
+ 0xffff,
};
Re2
the additional pairs and the correction to the combining case
here were not accepted to sources, but they allow for large character
classes generated used by folding. many of the characters are contiguous
so getting the contiguous case right is important.
/n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
Re2
re2class(char *s)
{
- Rune pairs[200+2], *p, *q, ov;
+ Rune pairs[400+2], *p, *q, ov;
int nc;
Re2 x;
/n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
break;
p[1] = *p;
p += 2;
- if(p >= pairs + nelem(pairs) - 2)
+ if(p == pairs + nelem(pairs) - 2)
error("class too big");
s += chartorune(p, s);
if(*p != '-')
/n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
for(p=pairs+2; *p; p+=2) {
if(p[0] > p[1])
continue;
- if(p[0] > q[1] || p[1] < q[0]) {
+ if(p[0] > q[1]+1 || p[1] < q[0]) {
q[2] = p[0];
q[3] = p[1];
q += 2;
i believe this case is also critical. split the bmp off.
/n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
x = re2or(x, rclass(ov, p[0]-1));
ov = p[1]+1;
}
- x = re2or(x, rclass(ov, Runemask));
+ x = re2or(x, rclass(ov, 0xffff));
} else {
x = rclass(p[0], p[1]);
for(p+=2; *p; p+=2)
- erik