Hi Grep Guys. A while back David Millis reported a rather strange problem with gawk 4.0.0 on Windows:
> Date: Sat, 10 Sep 2011 23:13:25 -0700 (PDT) > From: David Millis <[email protected]> > To: [email protected] > Subject: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars > > # A bug in GNU AWK 4.0.0's regex handling? > # 3.1.6 (GnuWin32)/3.1.7 (Jgawk?, had |& intact) worked. > # It cripples manipulation of mildly exotic chars. > # In Windows anyway (Binary: http://www.klabaster.com/freeware.htm#dl). > # I couldn't reproduce it in Debian with 4.0.0. > > BEGIN { > # For this, escaping is no different from pasting the genuine char. > badChar = "\x95"; > # This is a bullet (\x95, vim: ctrl-v+149) in the Win-1252 codepage. > # It happens to be in the \x80-\x9f range > # where Win-1252 diverges from strict Latin-1. > # Most apps don't care, but this might be the issue... > # Hmm, middledot (\xb7, vim: ctrl-v+183) shows the same behavior. > > print badChar; # Print's fine > print gensub(/\x95/, "@", "", badChar); # Error > > # The char is acceptable as the gsub/gensub replacement arg. > # But not as the pattern: be it /literal/ or "string". > # Upon reaching the line, gsub/gensub throw "unbalanced )". > # Or an "internal error" if used in a character class /[\x95]/. > > # Mundane escapes like \x22 for double-quote are fine. > } > > > I sent this to Eli Zaretskii, who replied: > > This also happens in 3.1.8 (on Windows). > > > > Please send this bug report to [email protected], > > I have no idea what is wrong with this character, > > and why only on Windows. > > > David Eli finally traced this down. His report and fix follow. Can y'all comment on this please? In particular, is there a different or better way to fix this? Unless I hear differently from you, I plan to apply the patch in the next day or two. Thanks, Arnold > Date: Fri, 30 Sep 2011 16:33:35 +0300 > From: Eli Zaretskii <[email protected]> > Subject: Re: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars > To: [email protected] > Cc: [email protected], [email protected] > > > Date: Mon, 12 Sep 2011 07:19:10 GMT > > From: [email protected] > > Cc: [email protected], [email protected] > > > > Otherwise, it looks like a problem with compiling the regular expression. > > Start with make_regexp and keep digging down. You may want to try > > compiling without optimzatin; I've seen the regex code break optimizers > > before. > > No, optimizations have nothing to do with this (I see the problem in a > non-optimized build as well). > > This bug is caused by the most mundane and dull issue with mixing > signed and unsigned. To tell the truth, I never expected to see such > issues in GNU sources that are used for such a long time. > > Here's the thing. The fatal error comes from here: > > regexp(); > > if (tok != END) > dfaerror(_("unbalanced )")); > > I.e., dfaparse expects all the string to be exhausted when `regexp' > returns. In `regexp' we see: > > static void > regexp (void) > { > branch(); > while (tok == OR) > { > tok = lex(); > branch(); > addtok(OR); > } > } > > where `branch' does this: > > static void > branch (void) > { > closure(); > while (tok != RPAREN && tok != OR && tok >= 0) > { > closure(); > addtok(CAT); > } > } > > Note that `branch' terminates the loop when `tok' is negative (and > there are other subroutines of dfa.c that do the same). Now, `tok' > is an enumerated data type that has a single negative value: > > typedef enum > { > END = -1, > > /* Ordinary character values are terminal symbols that match themselves. > */ > > EMPTY = NOTCHAR, /* EMPTY is a terminal symbol that matches > ... > > NOTCHAR is 256. So obviously, `branch' assumes that `tok' will only > be negative when its value is END. However, `lex' calls FETCH_WC and > FETCH macros that on Windows return negative values for any character > greater than 127. So the loop ends prematurely, and the rest is > history. > > Why do we get negative values from FETCH_WC and FETCH? Because they > assume that casting to an unsigned type converts a negative value to a > positive one. But what happens in fact is sign extension, so instead > of 0x95 we get 0xffffff95. Assigning this to a signed int (because > `tok's return value has the same enumerated type mentioned above, > which must be signed to accommodate for -1) converts back to a > negative value. > > I can fix the problem with the following simple patch. I don't > consider myself an expert on futzing with signed and unsigned values, > so I'll leave it to the experts to figure out The Right Way if this > one isn't. I did test the patch on GNU/Linux and verified that > David's script works there after applying the patch below. > > 2011-09-30 Eli Zaretskii <[email protected]> > > * dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather than > a sign-extended one. Fixes a bug on MS-Windows with compiling > patterns that include characters with the 8-th bit set. > Reported by David Millis <[email protected]>. > > --- dfa.c.orig 2011-06-23 12:27:01.000000000 +0300 > +++ dfa.c 2011-09-30 16:06:25.609375000 +0300 > @@ -691,19 +691,22 @@ static unsigned char const *buf_end; /* > else \ > { \ > wchar_t _wc; \ > + unsigned char uc; \ > cur_mb_len = mbrtowc(&_wc, lexptr, lexleft, &mbs); \ > if (cur_mb_len <= 0) \ > { \ > cur_mb_len = 1; \ > --lexleft; \ > - (wc) = (c) = (unsigned char) *lexptr++; \ > + uc = (unsigned char) *lexptr++; \ > + (wc) = (c) = uc; \ > } \ > else \ > { \ > lexptr += cur_mb_len; \ > lexleft -= cur_mb_len; \ > (wc) = _wc; \ > - (c) = wctob(wc); \ > + uc = (unsigned) wctob(wc); \ > + (c) = uc; \ > } \ > } \ > } while(0) > @@ -718,6 +721,7 @@ static unsigned char const *buf_end; /* > /* Note that characters become unsigned here. */ > # define FETCH(c, eoferr) \ > do { \ > + unsigned char uc; \ > if (! lexleft) \ > { \ > if ((eoferr) != 0) \ > @@ -725,7 +729,8 @@ static unsigned char const *buf_end; /* > else \ > return lasttok = END; \ > } \ > - (c) = (unsigned char) *lexptr++; \ > + uc = (unsigned char) *lexptr++; \ > + (c) = uc; \ > --lexleft; \ > } while(0) > >
