dfa - gawk matching problem on windows and suggested fix

Aharon Robbins Sun, 02 Oct 2011 12:14:27 -0700

Hi Grep Guys.

A while back David Millis reported a rather strange problem with gawk 4.0.0
on Windows:


> Date: Sat, 10 Sep 2011 23:13:25 -0700 (PDT)
> From: David Millis <[email protected]>
> To: [email protected]
> Subject: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars
>
> # A bug in GNU AWK 4.0.0's regex handling?
> # 3.1.6 (GnuWin32)/3.1.7 (Jgawk?, had |& intact) worked.
> # It cripples manipulation of mildly exotic chars.
> # In Windows anyway (Binary: http://www.klabaster.com/freeware.htm#dl).
> # I couldn't reproduce it in Debian with 4.0.0.
>
> BEGIN {
>   # For this, escaping is no different from pasting the genuine char.
>   badChar = "\x95";
>   # This is a bullet (\x95, vim: ctrl-v+149) in the Win-1252 codepage.
>   # It happens to be in the \x80-\x9f range
>   #   where Win-1252 diverges from strict Latin-1.
>   # Most apps don't care, but this might be the issue...
>   # Hmm, middledot (\xb7, vim: ctrl-v+183) shows the same behavior.
>
>   print badChar; # Print's fine
>   print gensub(/\x95/, "@", "", badChar); # Error
>
>   # The char is acceptable as the gsub/gensub replacement arg.
>   # But not as the pattern: be it /literal/ or "string".
>   # Upon reaching the line, gsub/gensub throw "unbalanced )".
>   # Or an "internal error" if used in a character class /[\x95]/.
>
>   # Mundane escapes like \x22 for double-quote are fine.
> }
>
>
> I sent this to Eli Zaretskii, who replied:
> > This also happens in 3.1.8 (on Windows).
> >
> > Please send this bug report to [email protected],
> > I have no idea what is wrong with this character,
> > and why only on Windows.
>
>
> David

Eli finally traced this down. His report and fix follow.  Can y'all
comment on this please? In particular, is there a different or better way to
fix this?  Unless I hear differently from you, I plan to apply the
patch in the next day or two.

Thanks,

Arnold

> Date: Fri, 30 Sep 2011 16:33:35 +0300
> From: Eli Zaretskii <[email protected]>
> Subject: Re: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars
> To: [email protected]
> Cc: [email protected], [email protected]
>
> > Date: Mon, 12 Sep 2011 07:19:10 GMT
> > From: [email protected]
> > Cc: [email protected], [email protected]
> > 
> > Otherwise, it looks like a problem with compiling the regular expression.
> > Start with make_regexp and keep digging down.  You may want to try
> > compiling without optimzatin; I've seen the regex code break optimizers
> > before.
>
> No, optimizations have nothing to do with this (I see the problem in a
> non-optimized build as well).
>
> This bug is caused by the most mundane and dull issue with mixing
> signed and unsigned.  To tell the truth, I never expected to see such
> issues in GNU sources that are used for such a long time.
>
> Here's the thing.  The fatal error comes from here:
>
>   regexp();
>
>   if (tok != END)
>     dfaerror(_("unbalanced )"));
>
> I.e., dfaparse expects all the string to be exhausted when `regexp'
> returns.  In `regexp' we see:
>
>   static void
>   regexp (void)
>   {
>     branch();
>     while (tok == OR)
>       {
>       tok = lex();
>       branch();
>       addtok(OR);
>       }
>   }
>
> where `branch' does this:
>
>   static void
>   branch (void)
>   {
>     closure();
>     while (tok != RPAREN && tok != OR && tok >= 0)
>       {
>       closure();
>       addtok(CAT);
>       }
>   }
>
> Note that `branch' terminates the loop when `tok' is negative (and
> there are other subroutines of dfa.c that do the same).  Now, `tok'
> is an enumerated data type that has a single negative value:
>
>   typedef enum
>   {
>     END = -1,
>
>     /* Ordinary character values are terminal symbols that match themselves. 
> */
>
>     EMPTY = NOTCHAR,          /* EMPTY is a terminal symbol that matches
>     ...
>
> NOTCHAR is 256.  So obviously, `branch' assumes that `tok' will only
> be negative when its value is END.  However, `lex' calls FETCH_WC and
> FETCH macros that on Windows return negative values for any character
> greater than 127.  So the loop ends prematurely, and the rest is
> history.
>
> Why do we get negative values from FETCH_WC and FETCH?  Because they
> assume that casting to an unsigned type converts a negative value to a
> positive one.  But what happens in fact is sign extension, so instead
> of 0x95 we get 0xffffff95.  Assigning this to a signed int (because
> `tok's return value has the same enumerated type mentioned above,
> which must be signed to accommodate for -1) converts back to a
> negative value.
>
> I can fix the problem with the following simple patch.  I don't
> consider myself an expert on futzing with signed and unsigned values,
> so I'll leave it to the experts to figure out The Right Way if this
> one isn't.  I did test the patch on GNU/Linux and verified that
> David's script works there after applying the patch below.
>
> 2011-09-30  Eli Zaretskii  <[email protected]>
>
>       * dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather than
>       a sign-extended one.  Fixes a bug on MS-Windows with compiling
>       patterns that include characters with the 8-th bit set.
>       Reported by David Millis <[email protected]>.
>
> --- dfa.c.orig        2011-06-23 12:27:01.000000000 +0300
> +++ dfa.c     2011-09-30 16:06:25.609375000 +0300
> @@ -691,19 +691,22 @@ static unsigned char const *buf_end;    /* 
>      else                                     \
>        {                                              \
>          wchar_t _wc;                         \
> +        unsigned char uc;                    \
>          cur_mb_len = mbrtowc(&_wc, lexptr, lexleft, &mbs); \
>          if (cur_mb_len <= 0)                 \
>            {                                  \
>              cur_mb_len = 1;                  \
>              --lexleft;                               \
> -            (wc) = (c) = (unsigned char) *lexptr++; \
> +            uc = (unsigned char) *lexptr++;  \
> +         (wc) = (c) = uc;                    \
>            }                                  \
>          else                                 \
>            {                                  \
>              lexptr += cur_mb_len;            \
>              lexleft -= cur_mb_len;           \
>              (wc) = _wc;                              \
> -            (c) = wctob(wc);                 \
> +            uc = (unsigned) wctob(wc);               \
> +            (c) = uc;                                \
>            }                                  \
>        }                                              \
>    } while(0)
> @@ -718,6 +721,7 @@ static unsigned char const *buf_end;      /* 
>  /* Note that characters become unsigned here. */
>  # define FETCH(c, eoferr)          \
>    do {                                     \
> +    unsigned char uc;                      \
>      if (! lexleft)                 \
>        {                                    \
>          if ((eoferr) != 0)         \
> @@ -725,7 +729,8 @@ static unsigned char const *buf_end;      /* 
>          else                       \
>            return lasttok = END;            \
>        }                                    \
> -    (c) = (unsigned char) *lexptr++;  \
> +    uc = (unsigned char) *lexptr++;   \
> +    (c) = uc;                              \
>      --lexleft;                             \
>    } while(0)
>  
>

dfa - gawk matching problem on windows and suggested fix

Reply via email to