thanks eric.
that fixed problems of my sample data!

Kenji Arisawa

2014/03/30 8:54、erik quanstrom <[email protected]> のメール:

>> Hello,
>> 
>> I found a strange bug in grep.
>> some Japanese runes does not match ‘[^0-9]’.
>> 
>> for example ‘ま' (307e) and ‘み’(307f).
>> 
> 
> i can't replicate here with 9atom's fixes to grep.
> with the same t3 file as you've got,
> 
>       ; wc -l /tmp/t3
>            21 /tmp/t3
>       ; grep -v '^[0-9]' /tmp/t3 | wc -l
>            21
> 
> i have some other differences in grep, including -I (same
> as -i, except fold runes), but i think the differences in
> comp.c are what cause the bug.  in particular, you really
> need that 0xffff entry in the tabs.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
>  {
>       0x007f,
>       0x07ff,
> +     0xffff,
>  };
>  Rune tab2[] =
>  {
>       0x003f,
>       0x0fff,
> +     0xffff,
>  };
> 
>  Re2
> 
> the additional pairs and the correction to the combining case
> here were not accepted to sources, but they allow for large character
> classes generated used by folding.  many of the characters are contiguous
> so getting the contiguous case right is important.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
>  Re2
>  re2class(char *s)
>  {
> -     Rune pairs[200+2], *p, *q, ov;
> +     Rune pairs[400+2], *p, *q, ov;
>       int nc;
>       Re2 x;
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
>                       break;
>               p[1] = *p;
>               p += 2;
> -             if(p >= pairs + nelem(pairs) - 2)
> +             if(p == pairs + nelem(pairs) - 2)
>                       error("class too big");
>               s += chartorune(p, s);
>               if(*p != '-')
> /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
>       for(p=pairs+2; *p; p+=2) {
>               if(p[0] > p[1])
>                       continue;
> -             if(p[0] > q[1] || p[1] < q[0]) {
> +             if(p[0] > q[1]+1 || p[1] < q[0]) {
>                       q[2] = p[0];
>                       q[3] = p[1];
>                       q += 2;
> 
> i believe this case is also critical.  split the bmp off.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
>                       x = re2or(x, rclass(ov, p[0]-1));
>                       ov = p[1]+1;
>               }
> -             x = re2or(x, rclass(ov, Runemask));
> +             x = re2or(x, rclass(ov, 0xffff));
>       } else {
>               x = rclass(p[0], p[1]);
>               for(p+=2; *p; p+=2)
> 
> - erik
> 


Reply via email to