Il 25/06/2013 07:54, Aharon Robbins ha scritto: > Hi All. > > A few weeks ago I reported a bug in dfa that was causing an assertion > failure in gawk. It occurred in UTF locales. Attached are two test > programs, one that fails, and one that doesn't, and input. The failure > is seen in gawk, I'm not sure how to reproduce in grep. > > Mike Haertel was kind enough to track down the problem and fix it for > me. His email is below. I have already applied his patch to the gawk > code. I applied it to the dfa in grep 2.14 and grep passed its "make check" > but that is just FYI. > > If you (plural) think it makes sense, please apply Mike's fix to > the grep dfa also, so that we can stay in sync. > > Thanks! > > Arnold > >> To: [email protected] >> Subject: tentative fix for DFA bug, with explanation >> Date: Sun, 23 Jun 2013 12:24:03 -0700 >> From: Mike Haertel <[email protected]> >> >> Since this is 7-bit email, let FOO stand for the chinese character >> in the text regexp. It turns out the following much simpler regexp: >> ([^.]*[FOO]){1,2} >> is sufficient to cause the crash. >> >> Some background info: in the first step of its parsing, DFA transforms >> regexp from human readable syntax into reverse-polish form. For >> example, if the regexp is a|b (where a and b are arbitrary >> subexpressions), the RPN representation looks like: >> >> <RPN representation for a> >> <RPN representation for b> >> OR >> >> and if the regexp is ab, the RPN representation is >> >> <RPN representation for a> >> <RPN representation for b> >> CAT >> >> For regexps of the form a{m,n} repeat counts, it simply builds >> repeated copies of the representation of a, with appropriate inserted >> CAT and QMARK operators. For the above example with a regexp of >> the form a{1,2} it would build: >> >> <RPN representation for a> >> <RPN representation for a> >> QMARK >> CAT >> >> When building repeated copies of RPN representations, additional >> copies of the RPN representations are made by calling a function >> copytoks() with arguments consisting of the start position and >> length of the original copy. >> >> The problem is that the current code for copytoks() is simply >> incorrect. It operates by calling addtok() for each individual >> token in the source range being copied. But, in the particular >> case that the token being added is MBCSET, addtok(): >> >> (1) incorrectly assumes that the character set being added to be added >> is the one most (addtok has no argument to indicate which cset is >> being added, so it just uses the latest one) >> >> (2) attempts to do some token sequence expansion into more primitive >> operators so things like [FOO] are matched efficiently. >> >> Both of these assumptions are incorrect in the case that addtok() >> is being called from copytoks(): (1) is simply not true, and >> (2) is redundant--the expansion has already been done token sequence >> being copied, so there is no need to do the expansion again. >> >> Based on my reading of the code, it looks like the correct function to >> add exactly one token, without further expansion, is addtok_mb(). >> >> So here is my proposed fix, which is that copytoks() should never call >> addtok(), but instead directly call addtok_mb() (which is what addtok() >> eventually calls). >> >> diff --git a/dfa.c b/dfa.c >> index 54e0ae9..2195e28 100644 >> --- a/dfa.c >> +++ b/dfa.c >> @@ -1847,13 +1847,12 @@ copytoks (size_t tindex, size_t ntokens) >> { >> size_t i; >> >> - for (i = 0; i < ntokens; ++i) >> - { >> - addtok (dfa->tokens[tindex + i]); >> - /* Update index into multibyte csets. */ >> - if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET) >> - dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex + >> i]; >> - } >> + if (MB_CUR_MAX > 1) >> + for (i = 0; i < ntokens; ++i) >> + addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]); >> + else >> + for (i = 0; i < ntokens; ++i) >> + addtok_mb(dfa->tokens[tindex + i], 3); >> } >> >> static void >> >> With this fix, both tp1.awk and tp2.awk pass. I haven't tested it on >> anything >> else, since I have no idea how to run your validation suite (if any). >> >> So please try this out on your test suite and let me know how it works.
Looks good, thanks! Paolobug-g
