In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/a0e31c931629c9c729765aaa725b0d7297b3933c?hp=8cdde9f826664af3e1c4c5f5f1bd9642d7aee812>
- Log ----------------------------------------------------------------- commit a0e31c931629c9c729765aaa725b0d7297b3933c Merge: 8cdde9f a35e750 Author: Nicholas Clark <[email protected]> Date: Tue Mar 19 11:53:38 2013 +0100 Merge in the changes that remove setjmp() from regcomp.c The code now uses regular returns instead of setjmp() and longjmp() for signalling the need for pattern compilation to restart. By avoiding this, and the corresponding need to mark many variables as volatile, we make the code less fragile, and Address Sanitizer is now happy. commit a35e7505f3e9cd4a87a7d911c4e7ae19e97cb9f6 Author: Nicholas Clark <[email protected]> Date: Mon Feb 4 17:54:33 2013 +0100 Document the uses of NULL returns in the regex parsing code. M pod/perlreguts.pod commit d2d8c7119aed4a86948ff470a23161eb2f07f551 Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 20:58:49 2013 +0100 Revert "PATCH: regex longjmp flaws" This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc. The netbsd - 5.0.2 compiler pointed out that the recent changes to add longjmps to speed up some regex compilations can result in clobbering a few values. These depend on the compiled code, and so didn't show up in other compiler's warnings. This patch reinitializes them after a longjmp. [With a lot of hand editing in regcomp.c, to propagate the changes through subsequent commits.] M regcomp.c M regcomp.h commit 1067df30ae91aad06ce62722a2ebd437e25358bf Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 20:32:01 2013 +0100 In Perl_re_op_compile(), tidy up after removing setjmp(). Remove volatile qualifiers. Remove the variable jump_ret. Move the initialisation of restudied back to the declaration. This reverts several of the changes made by commits 5d51ce98fae3de07 and bbd61b5ffb7621c2. However, I can't see a cleaner way to avoid code duplication when restarting the parse than to approach I've taken here - the label redo_first_pass is now inside an if (0) block, which is clear but ugly. M embed.fnc M proto.h M regcomp.c commit a9d37de4f0d6753136e788d7997eaef2b3e47ba0 Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 17:15:30 2013 +0100 Replace the longjmp()s in Perl_re_op_compile() with goto. The regex parse needs to be restarted if it turns out that it should be done as UTF-8, not bytes. Using setjmp()/longjmp() complicates compilation considerably, causing warnings about missing use of volatile, and hitting code generation errors from clang's ASAN. Using goto is much clearer. M regcomp.c commit 8aed9c310f6afc92a1b486e88fa61c5ccd648a9a Author: Nicholas Clark <[email protected]> Date: Sat Jan 19 11:06:10 2013 +0100 Move the longjmp() that implements REQUIRE_UTF8 up to Perl_re_op_compile(). With longjmp() and setjmp() now in the same function (and all tests passing), it becomes easy to replace the pair with a goto. Still evil, but "the lesser of two evils". M regcomp.c commit 8d9c2815bfe0506d5c64749349c443f812453ba1 Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 17:21:03 2013 +0100 Add a flag RESTART_UTF8 to the reg*() routines in regcomp.c Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to permit the parse to be restarted without using longjmp(). However, it's not used yet. M regcomp.c commit 5a4ca00e1ae70ead90d5ddb57d6bf189a749ae9c Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 11:32:44 2013 +0100 In S_regclass(), create listsv as a mortal, claiming a reference if needed. The SV listsv is sometimes stored in an array generated near the end of S_regclass(). In other cases it is not used, and it needs to be freed if any of the warnings that S_regclass() can trigger turn out to be fatal. The simplest solution to this problem is to declare it from the start as a mortal, and claim a (new) reference to it if it is *not* to be freed. This permits the removal of all other code related to ensuring that it is freed at the right time, but not freed prematurely if a call to a warning returns. M embed.fnc M embed.h M proto.h M regcomp.c commit b8989050d30609050cd38eb5df7c4142da5f63bf Author: Nicholas Clark <[email protected]> Date: Thu Jan 17 11:47:13 2013 +0100 Document when and why S_reg{,branch,piece,atom,class}() return NULL. As documented in pod/perlreguts.pod, the call graph for regex parsing involves several levels of functions in regcomp.c, sometimes recursing more than once. The top level compiling function, S_reg(), calls S_regbranch() to parse each single branch of an alternation. In turn, that calls S_regpiece() to parse a simple pattern followed by quantifier, which calls S_regatom() to parse that simple pattern. S_regatom() can call S_regclass() to handle classes, but can also recurse into S_reg() to handle subpatterns and some other constructions. Some other routines call call S_reg(), sometimes using an alternative pattern that they generate dynamically to represent their input. These routines all return a pointer to a regnode structure, and take a pointer to an integer that holds flags, which is also used to return information. Historically, it has not been clear when and why they return NULL, and whether the return value can be ignored. In particular, "Jumbo regexp patch" (commit c277df42229d99fe, from Nov 1997), added code with two calls from S_reg() to S_regbranch(), one of which checks the return value and generates a LONGJMP node if it returns NULL, the other of which is called in void context, and so both ignores any return value, or the possibility that it is NULL. After some analysis I have untangled the possible return values from these 5 functions (and related functions which call S_reg()). Starting from the top: S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma- like constructions that it handles. Otherwise, historically it would return NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece() returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL. S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it has similar behaviour when passing back return values, although often it is able to loop instead on getting a TRYAGAIN. Which gets us back to S_reg(), which can only *generate* NULL in conjunction with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it called generated it. All other functions that these call that return regnode structures cannot return NULL. Hence 1) in the loop of functions called, there is no source for a return value of NULL without the TRYAGAIN flag being set 2) a return value of NULL with TRYAGAIN set from an inner function does not propagate out past S_regbranch() Hence the only return values that most functions can generate are non-NULL, or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot return NULL. The longest sequence of functions that can return NULL (with TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch(). Rapidly returning right round the loop back to S_reg() is not possible. Hence code added by commit c277df42229d99fe to handle a NULL return from S_regbranch(), along with some other code is dead. I have replaced all unreachable code with FAIL()s that panic. M regcomp.c commit 5c9bea1dbbeef502005513dbe1d6e3942dfd3f27 Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 16:30:39 2013 +0100 Return orig_emit from S_regclass() when ret_invlist is true. The return value isn't used (yet). Previously the code was returning END, which is a macro for the regnode number for "End of program" which happens to be 0. It happens that 0 is valid C for a NULL pointer, but using it in this way makes the intent unclear. Not only is orig_emit a valid type, it's semantically the correct thing to return, as it's most recently added node. M regcomp.c commit e22f9b1c6a54c0d78a629d4d4160914b2a563855 Author: Nicholas Clark <[email protected]> Date: Wed Jan 16 21:58:02 2013 +0100 Test that UTF-8 in the look-ahead of (?(?=...)...) restarts the sizing parse. S_reg() recurses to itself to parse various constructions used as the conditionals in conditional matching. Look-aheads and look-behinds can turn out to need to be sized as UTF-8, which can cause the inner S_reg() to use the macro REQUIRE_UTF8 is used to restart the parse. Test that this is handled correctly. M t/re/re_tests commit ba7b73c5937452f0c133ba6521ce46776079f76b Author: Nicholas Clark <[email protected]> Date: Wed Jan 16 17:08:03 2013 +0100 Test that S_grok_bslash_N() copes if S_reg() restarts the sizing parse. S_reg() can discover midway through parsing the pattern to determine its size, that the pattern will actually need to be encoded as UTF-8. If calculations so far have been done in terms of bytes, then the macro REQUIRE_UTF8 is used to restart the parse, so that sizes can be calculated correctly for UTF-8. It is possible to trigger this restart when processing multi-character charnames interpolated into the pattern using \N{}. Test that this is handled correctly. M t/re/pat_advanced.t commit 504858073fe16afb61d66a8b6748851780e51432 Author: Nicholas Clark <[email protected]> Date: Mon Jan 14 09:46:48 2013 +0100 Remove unreachable duplicate (?#...) parsing code from S_reg() I believe that this code was rendered unreachable when perl 5.001 added code to S_nextchar() to skip over embedded comments. Adrian Enache noted this in March 2003, and proposed a patch which removed it. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html The patch wasn't applied at that time, and when he sent it again August, he omitted that hunk. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html That version was applied as commit e994fd663a4d8acc. M regcomp.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 4 +- embed.h | 2 +- pod/perlreguts.pod | 46 ++++++++ proto.h | 4 +- regcomp.c | 301 ++++++++++++++++++++++++++++++--------------------- regcomp.h | 4 +- t/re/pat_advanced.t | 8 ++ t/re/re_tests | 4 + 8 files changed, 242 insertions(+), 131 deletions(-) diff --git a/embed.fnc b/embed.fnc index ba20cf7..6f3adc6 100644 --- a/embed.fnc +++ b/embed.fnc @@ -1106,7 +1106,7 @@ Ap |REGEXP*|pregcomp |NN SV * const pattern|const U32 flags p |REGEXP*|re_op_compile |NULLOK SV ** const patternp \ |int pat_count|NULLOK OP *expr \ |NN const regexp_engine* eng \ - |NULLOK REGEXP *VOL old_re \ + |NULLOK REGEXP *old_re \ |NULLOK bool *is_bare_re \ |U32 rx_flags|U32 pm_flags Ap |REGEXP*|re_compile |NN SV * const pattern|U32 orig_rx_flags @@ -2037,7 +2037,7 @@ EsRn |U32 |add_data |NN struct RExC_state_t *pRExC_state|U32 n \ |NN const char *s rs |void |re_croak2 |NN const char* pat1|NN const char* pat2|... Ei |I32 |regpposixcc |NN struct RExC_state_t *pRExC_state \ - |I32 value|NULLOK SV *free_me|const bool strict + |I32 value|const bool strict Es |I32 |make_trie |NN struct RExC_state_t *pRExC_state \ |NN regnode *startbranch|NN regnode *first \ |NN regnode *last|NN regnode *tail \ diff --git a/embed.h b/embed.h index 7f9be53..f32f446 100644 --- a/embed.h +++ b/embed.h @@ -943,7 +943,7 @@ #define reginsert(a,b,c,d) S_reginsert(aTHX_ a,b,c,d) #define regpatws S_regpatws #define regpiece(a,b,c) S_regpiece(aTHX_ a,b,c) -#define regpposixcc(a,b,c,d) S_regpposixcc(aTHX_ a,b,c,d) +#define regpposixcc(a,b,c) S_regpposixcc(aTHX_ a,b,c) #define regtail(a,b,c,d) S_regtail(aTHX_ a,b,c,d) #define reguni(a,b,c) S_reguni(aTHX_ a,b,c) #define regwhite S_regwhite diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod index fbcb149..bb7f372 100644 --- a/pod/perlreguts.pod +++ b/pod/perlreguts.pod @@ -386,6 +386,52 @@ A grammar form might be something like this: piece : _piece | _piece quant +=head3 Parsing complications + +The implication of the above description is that a pattern containing nested +parentheses will result in a call graph which cycles through C<reg()>, +C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc> +multiple times, until the deepest level of nesting is reached. All the above +routines return a pointer to a C<regnode>, which is usually the last regnode +added to the program. However, one complication is that reg() returns NULL +for parsing C<(?:)> syntax for embedded modifiers, setting the flag +C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in +some cases by by C<regatom()>, but otherwise unconditionally by +C<regbranch()>. Hence it will never be returned by C<regbranch()> to +C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as +errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+ +<-- HERE />). + +Another complication is that the representation used for the program differs +if it needs to store Unicode, but it's not always possible to know for sure +whether it does until midway through parsing. The Unicode representation for +the program is larger, and cannot be matched as efficiently. (See L</Unicode +and Localisation Support> below for more details as to why.) If the pattern +contains literal Unicode, it's obvious that the program needs to store +Unicode. Otherwise, the parser optimistically assumes that the more +efficient representation can be used, and starts sizing on this basis. +However, if it then encounters something in the pattern which must be stored +as Unicode, such as an C<\x{...}> escape sequence representing a character +literal, then this means that all previously calculated sizes need to be +redone, using values appropriate for the Unicode representation. Currently, +all regular expression constructions which can trigger this are parsed by code +in C<regatom()>. + +To avoid wasted work when a restart is needed, the sizing pass is abandoned +- C<regatom()> immediately returns NULL, setting the flag C<RESTART_UTF8>. +(This action is encapsulated using the macro C<REQUIRE_UTF8>.) This restart +request is propagated up the call chain in a similar fashion, until it is +"caught" in C<Perl_re_op_compile()>, which marks the pattern as containing +Unicode, and restarts the sizing pass. It is also possible for constructions +within run-time code blocks to turn out to need Unicode representation., +which is signalled by C<S_compile_runtime_code()> returning false to +C<Perl_re_op_compile()>. + +The restart was previously implemented using a C<longjmp> in C<regatom()> +back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be +problematic as the latter is a large function containing many automatic +variables, which interact badly with the emergent control flow of C<setjmp>. + =head3 Debug Output In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE' >> diff --git a/proto.h b/proto.h index 3d546a0..52f2098 100644 --- a/proto.h +++ b/proto.h @@ -3263,7 +3263,7 @@ PERL_CALLCONV SV* Perl_re_intuit_string(pTHX_ REGEXP *const r) #define PERL_ARGS_ASSERT_RE_INTUIT_STRING \ assert(r) -PERL_CALLCONV REGEXP* Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, OP *expr, const regexp_engine* eng, REGEXP *VOL old_re, bool *is_bare_re, U32 rx_flags, U32 pm_flags) +PERL_CALLCONV REGEXP* Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, OP *expr, const regexp_engine* eng, REGEXP *old_re, bool *is_bare_re, U32 rx_flags, U32 pm_flags) __attribute__nonnull__(pTHX_4); #define PERL_ARGS_ASSERT_RE_OP_COMPILE \ assert(eng) @@ -6711,7 +6711,7 @@ STATIC regnode* S_regpiece(pTHX_ struct RExC_state_t *pRExC_state, I32 *flagp, U #define PERL_ARGS_ASSERT_REGPIECE \ assert(pRExC_state); assert(flagp) -PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, SV *free_me, const bool strict) +PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, const bool strict) __attribute__nonnull__(pTHX_1); #define PERL_ARGS_ASSERT_REGPPOSIXCC \ assert(pRExC_state) diff --git a/regcomp.c b/regcomp.c index 29434b9..316c4ee 100644 --- a/regcomp.c +++ b/regcomp.c @@ -231,8 +231,9 @@ typedef struct RExC_state_t { * REGNODE_SIMPLE */ #define SIMPLE 0x02 #define SPSTART 0x04 /* Starts with * or + */ -#define TRYAGAIN 0x08 /* Weeded out a declaration. */ -#define POSTPONED 0x10 /* (?1),(?&name), (??{...}) or similar */ +#define POSTPONED 0x08 /* (?1),(?&name), (??{...}) or similar */ +#define TRYAGAIN 0x10 /* Weeded out a declaration. */ +#define RESTART_UTF8 0x20 /* Restart, need to calcuate sizes as UTF-8 */ #define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1) @@ -251,10 +252,11 @@ typedef struct RExC_state_t { #define PAREN_SET(u8str,paren) PBYTE(u8str,paren) |= PBITVAL(paren) #define PAREN_UNSET(u8str,paren) PBYTE(u8str,paren) &= (~PBITVAL(paren)) -/* If not already in utf8, do a longjmp back to the beginning */ -#define UTF8_LONGJMP 42 /* Choose a value not likely to ever conflict */ #define REQUIRE_UTF8 STMT_START { \ - if (! UTF) JMPENV_JUMP(UTF8_LONGJMP); \ + if (!UTF) { \ + *flagp = RESTART_UTF8; \ + return NULL; \ + } \ } STMT_END /* This converts the named class defined in regcomp.h to its equivalent class @@ -5196,7 +5198,7 @@ S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, S REGEXP * Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, - OP *expr, const regexp_engine* eng, REGEXP *VOL old_re, + OP *expr, const regexp_engine* eng, REGEXP *old_re, bool *is_bare_re, U32 orig_rx_flags, U32 pm_flags) { dVAR; @@ -5204,32 +5206,29 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, struct regexp *r; regexp_internal *ri; STRLEN plen; - char * VOL exp; + char *exp; char* xend; regnode *scan; I32 flags; I32 minlen = 0; U32 rx_flags; - SV * VOL pat; - SV * VOL code_blocksv = NULL; + SV *pat; + SV *code_blocksv = NULL; /* these are all flags - maybe they should be turned * into a single int with different bit masks */ I32 sawlookahead = 0; I32 sawplus = 0; I32 sawopen = 0; - bool used_setjump = FALSE; regex_charset initial_charset = get_regex_charset(orig_rx_flags); bool code_is_utf8 = 0; - bool VOL recompile = 0; + bool recompile = 0; bool runtime_code = 0; - U8 jump_ret = 0; - dJMPENV; scan_data_t data; RExC_state_t RExC_state; RExC_state_t * const pRExC_state = &RExC_state; #ifdef TRIE_STUDY_OPT - int restudied; + int restudied = 0; RExC_state_t copyRExC_state; #endif GET_RE_DEBUG_FLAGS_DECL; @@ -5554,6 +5553,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, } exp = SvPV_nomg(pat, plen); + xend = exp + plen; if (!eng->op_comp) { if ((SvUTF8(pat) && IN_BYTES) @@ -5574,39 +5574,22 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, RExC_contains_locale = 0; pRExC_state->runtime_code_qr = NULL; - /****************** LONG JUMP TARGET HERE***********************/ - /* Longjmp back to here if have to switch in midstream to utf8 */ - if (! RExC_orig_utf8) { - JMPENV_PUSH(jump_ret); - used_setjump = TRUE; - } - - if (jump_ret == 0) { /* First time through */ - xend = exp + plen; - - DEBUG_COMPILE_r({ + DEBUG_COMPILE_r({ SV *dsv= sv_newmortal(); - RE_PV_QUOTED_DECL(s, RExC_utf8, - dsv, exp, plen, 60); + RE_PV_QUOTED_DECL(s, RExC_utf8, dsv, exp, plen, 60); PerlIO_printf(Perl_debug_log, "%sCompiling REx%s %s\n", - PL_colors[4],PL_colors[5],s); + PL_colors[4],PL_colors[5],s); }); - } - else { /* longjumped back */ - U8 *src, *dst; + + if (0) { + redo_first_pass: + { + U8 *const src = (U8*)exp; + U8 *dst; int n=0; STRLEN s = 0, d = 0; bool do_end = 0; - /* If the cause for the longjmp was other than changing to utf8, pop - * our own setjmp, and longjmp to the correct handler */ - if (jump_ret != UTF8_LONGJMP) { - JMPENV_POP; - JMPENV_JUMP(jump_ret); - } - - GET_RE_DEBUG_FLAGS; - /* It's possible to write a regexp in ascii that represents Unicode codepoints outside of the byte range, such as via \x{100}. If we detect such a sequence we have to convert the entire pattern to utf8 @@ -5622,7 +5605,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, * recalculate the indices. * This is essentially an unrolled Perl_bytes_to_utf8() */ - src = (U8*)SvPV_nomg(pat, plen); Newx(dst, plen * 2 + 1, U8); while (s < plen) { @@ -5655,6 +5637,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, xend = exp + plen; SAVEFREEPV(exp); RExC_orig_utf8 = RExC_utf8 = 1; + } } /* return old regex if pattern hasn't changed */ @@ -5670,9 +5653,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, runtime_code = S_has_runtime_code(aTHX_ pRExC_state, expr, pm_flags, exp, plen); if (!runtime_code) { - if (used_setjump) { - JMPENV_POP; - } Safefree(pRExC_state->code_blocks); return old_re; } @@ -5686,10 +5666,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, runtime_code = S_has_runtime_code(aTHX_ pRExC_state, expr, pm_flags, exp, plen); -#ifdef TRIE_STUDY_OPT - restudied = 0; -#endif - rx_flags = orig_rx_flags; if (initial_charset == REGEX_LOCALE_CHARSET) { @@ -5713,7 +5689,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, if (!S_compile_runtime_code(aTHX_ pRExC_state, exp, plen)) { /* whoops, we have a non-utf8 pattern, whilst run-time code * got compiled as utf8. Try again with a utf8 pattern */ - JMPENV_JUMP(UTF8_LONGJMP); + goto redo_first_pass; } } assert(!pRExC_state->runtime_code_qr); @@ -5761,7 +5737,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, pRExC_state->code_blocks. We cannot SAVEFREEPV it now, as we may need it to survive as long as the regexp (qr/(?{})/). We must check that code_blocksv is not already set, because we may - have longjmped back. */ + have jumped back to restart the sizing pass. */ if (pRExC_state->code_blocks && !code_blocksv) { code_blocksv = newSV_type(SVt_PV); SAVEFREESV(code_blocksv); @@ -5769,17 +5745,14 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, SvLEN_set(code_blocksv, 1); /*sufficient to make sv_clear free it*/ } if (reg(pRExC_state, 0, &flags,1) == NULL) { - RExC_precomp = NULL; - return(NULL); + if (flags & RESTART_UTF8) { + goto redo_first_pass; + } + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing pass, flags=%#X", flags); } if (code_blocksv) SvLEN_set(code_blocksv,0); /* no you can't have it, sv_clear */ - /* Here, finished first pass. Get rid of any added setjmp */ - if (used_setjump) { - JMPENV_POP; - } - DEBUG_PARSE_r({ PerlIO_printf(Perl_debug_log, "Required size %"IVdf" nodes\n" @@ -5945,7 +5918,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, REGC((U8)REG_MAGIC, (char*) RExC_emit++); if (reg(pRExC_state, 0, &flags,1) == NULL) { ReREFCNT_dec(rx); - return(NULL); + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for generation pass, flags=%#X", flags); } /* XXXX To minimize changes to RE engine we always allocate 3-units-long substrs field. */ @@ -8555,6 +8528,11 @@ S_parse_lparen_question_flags(pTHX_ struct RExC_state_t *pRExC_state) #define REGTAIL_STUDY(x,y,z) regtail((x),(y),(z),depth+1) #endif +/* Returns NULL, setting *flagp to TRYAGAIN at the end of (?) that only sets + flags. Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan + needs to be restarted. + Otherwise would only return NULL if regbranch() returns NULL, which + cannot happen. */ STATIC regnode * S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* paren: Parenthesized? 0=top, 1=(, inside: changed to letter. */ @@ -8854,14 +8832,6 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) case '@': /* (?@...) */ vFAIL2("Sequence (?%c...) not implemented", (int)paren); break; - case '#': /* (?#...) */ - while (*RExC_parse && *RExC_parse != ')') - RExC_parse++; - if (*RExC_parse != ')') - FAIL("Sequence (?#... not terminated"); - nextchar(pRExC_state); - *flagp = TRYAGAIN; - return NULL; case '0' : /* (?0) */ case 'R' : /* (?R) */ if (*RExC_parse != ')') @@ -9027,11 +8997,18 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) || RExC_parse[1] == '<' || RExC_parse[1] == '{') { /* Lookahead or eval. */ I32 flag; + regnode *tail; ret = reg_node(pRExC_state, LOGICAL); if (!SIZE_ONLY) ret->flags = 1; - REGTAIL(pRExC_state, ret, reg(pRExC_state, 1, &flag,depth+1)); + + tail = reg(pRExC_state, 1, &flag, depth+1); + if (flag & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + REGTAIL(pRExC_state, ret, tail); goto insert_if; } } @@ -9099,9 +9076,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) insert_if: REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 0)); br = regbranch(pRExC_state, &flags, 1,depth+1); - if (br == NULL) - br = reganode(pRExC_state, LONGJMP, 0); - else + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } else REGTAIL(pRExC_state, br, reganode(pRExC_state, LONGJMP, 0)); c = *nextchar(pRExC_state); if (flags&HASWIDTH) @@ -9110,7 +9092,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) if (is_define) vFAIL("(?(DEFINE)....) does not allow branches"); lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one for optimizer. */ - regbranch(pRExC_state, &flags, 1,depth+1); + if (!regbranch(pRExC_state, &flags, 1,depth+1)) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } REGTAIL(pRExC_state, ret, lastbr); if (flags&HASWIDTH) *flagp |= HASWIDTH; @@ -9192,8 +9181,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* branch_len = (paren != 0); */ - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } if (*RExC_parse == '|') { if (!SIZE_ONLY && RExC_extralen) { reginsert(pRExC_state, BRANCHJ, br, depth+1); @@ -9232,8 +9226,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) } br = regbranch(pRExC_state, &flags, 0, depth+1); - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } REGTAIL(pRExC_state, lastbr, br); /* BRANCH -> BRANCH. */ lastbr = br; *flagp |= flags & (SPSTART | HASWIDTH | POSTPONED); @@ -9390,6 +9389,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) - regbranch - one alternative of an | operator * * Implements the concatenation operator. + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) @@ -9429,7 +9431,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) if (latest == NULL) { if (flags & TRYAGAIN) continue; - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regpiece returned NULL, flags=%#X", flags); } else if (ret == NULL) ret = latest; @@ -9463,6 +9469,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) * both the endmarker for their branch list and the body of the last branch. * It might seem that this node could be dispensed with entirely, but the * endmarker role is not redundant. + * + * Returns NULL, setting *flagp to TRYAGAIN if regatom() returns NULL with + * TRYAGAIN. + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) @@ -9491,8 +9502,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) ret = regatom(pRExC_state, &flags,depth+1); if (ret == NULL) { - if (flags & TRYAGAIN) - *flagp |= TRYAGAIN; + if (flags & (TRYAGAIN|RESTART_UTF8)) + *flagp |= flags & (TRYAGAIN|RESTART_UTF8); + else + FAIL2("panic: regatom returned NULL, flags=%#X", flags); return(NULL); } @@ -9722,7 +9735,10 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I The function raises an error (via vFAIL), and doesn't return for various syntax errors. Otherwise it returns TRUE and sets <node_p> or <valuep> on - success; it returns FALSE otherwise. + success; it returns FALSE otherwise. Returns FALSE, setting *flagp to + RESTART_UTF8 if the sizing scan needs to be restarted. Such a restart is + only possible if node_p is non-NULL. + If <valuep> is non-null, it means the caller can accept an input sequence consisting of a just a single code point; <*valuep> is set to that value @@ -9926,7 +9942,14 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I /* The values are Unicode, and therefore not subject to recoding */ RExC_override_recoding = 1; - *node_p = reg(pRExC_state, 1, &flags, depth+1); + if (!(*node_p = reg(pRExC_state, 1, &flags, depth+1))) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return FALSE; + } + FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#X", + flags); + } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); RExC_parse = endbrace; @@ -10126,6 +10149,12 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, regnode *node, I32 escape sequences, with the one for handling literal escapes requiring a dummy entry for all of the special escapes that are actually handled by the other. + + Returns NULL, setting *flagp to TRYAGAIN if reg() returns NULL with + TRYAGAIN. + Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + restarted. + Otherwise does not return NULL. */ STATIC regnode * @@ -10193,6 +10222,12 @@ tryagain: RExC_parse = oregcomp_parse; vFAIL("Unmatched ["); } + if (ret == NULL) { + if (*flagp & RESTART_UTF8) + return NULL; + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); + } nextchar(pRExC_state); Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */ break; @@ -10209,7 +10244,11 @@ tryagain: } goto tryagain; } - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: reg returned NULL to regatom, flags=%#X", flags); } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); break; @@ -10401,6 +10440,11 @@ tryagain: It would be a bug if these returned non-portables */ NULL); + /* regclass() can only return RESTART_UTF8 if multi-char folds + are allowed. */ + if (!ret) + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); RExC_parse--; @@ -10423,6 +10467,8 @@ tryagain: ++RExC_parse; if (! grok_bslash_N(pRExC_state, &ret, NULL, flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + return NULL; RExC_parse--; goto defchar; } @@ -10689,6 +10735,8 @@ tryagain: flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); RExC_parse = p = oldp; goto loopdone; } @@ -11247,8 +11295,7 @@ S_regpatws( RExC_state_t *pRExC_state, char *p , const bool recognize_comment ) #define POSIXCC(c) (POSIXCC_DONE(c) || POSIXCC_NOTYET(c)) PERL_STATIC_INLINE I32 -S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, - const bool strict) +S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, const bool strict) { dVAR; I32 namedclass = OOB_NAMEDCLASS; @@ -11372,7 +11419,6 @@ S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, the class closes */ while (UCHARAT(RExC_parse) && UCHARAT(RExC_parse) != ']') RExC_parse++; - SvREFCNT_dec(free_me); vFAIL3("POSIX syntax [%c %c] is reserved for future extensions", c, c); } } else { @@ -11512,13 +11558,18 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, I32 *f RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char - class only if not a - posix class */ - FALSE, /* don't allow multi-char folds */ - TRUE, /* silence non-portable warnings. */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char + class only if not a + posix class */ + FALSE, /* don't allow multi-char folds */ + TRUE, /* silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); + /* function call leaves parse pointing to the ']', except * if we faked it */ if (is_posix_class) { @@ -11675,12 +11726,15 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, I32 *f vFAIL("Unexpected character"); case '\\': - (void) regclass(pRExC_state, flagp,depth+1, - TRUE, /* means parse just the next thing */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + TRUE, /* means parse just the next thing */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* regclass() will return with parsing just the \ sequence, * leaving the parse pointer at the next thing to parse */ RExC_parse--; @@ -11694,13 +11748,16 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, I32 *f RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char class - only if not a posix class */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if(!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char class + only if not a posix class */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* function call leaves parse pointing to the ']', except if we * faked it */ if (is_posix_class) { @@ -11880,6 +11937,8 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, I32 *f * already has all folding taken into consideration, and we don't want * regclass() to add to that */ RExC_flags &= ~RXf_PMf_FOLD; + /* regclass() can only return RESTART_UTF8 if multi-char folds are allowed. + */ node = regclass(pRExC_state, flagp,depth+1, FALSE, /* means parse the whole char class */ FALSE, /* don't allow multi-char folds */ @@ -11887,6 +11946,8 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, I32 *f well have generated non-portable code points, but they're valid on this machine */ NULL); + if (!node) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", flagp); if (save_fold) { RExC_flags |= RXf_PMf_FOLD; } @@ -11936,7 +11997,11 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, * corresponding bit set if that character is in the list. For characters * above 255, a range list or swash is used. There are extra bits for \w, * etc. in locale ANYOFs, as what these match is not determinable at - * compile time */ + * compile time + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs + * to be restarted. This can only happen if ret_invlist is non-NULL. + */ dVAR; UV prevvalue = OOB_UNICODE, save_prevvalue = OOB_UNICODE; @@ -12022,8 +12087,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, if (LOC) { ANYOF_FLAGS(ret) |= ANYOF_LOCALE; } - listsv = newSVpvs("# comment\n"); + listsv = newSVpvs_flags("# comment\n", SVs_TEMP); initial_listsv_len = SvCUR(listsv); + SvTEMP_off(listsv); /* Grr, TEMPs and mortals are conflated. */ } if (skip_white) { @@ -12051,12 +12117,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, s++; if (*s && c == *s && s[1] == ']') { SAVEFREESV(RExC_rx_sv); - SAVEFREESV(listsv); ckWARN3reg(s+2, "POSIX syntax [%c %c] belongs inside character classes", c, c); (void)ReREFCNT_inc(RExC_rx_sv); - SvREFCNT_inc_simple_void_NN(listsv); } } @@ -12108,7 +12172,7 @@ parseit: && RExC_parse < RExC_end && POSIXCC(UCHARAT(RExC_parse))) { - namedclass = regpposixcc(pRExC_state, value, listsv, strict); + namedclass = regpposixcc(pRExC_state, value, strict); } else if (value == '\\') { if (UTF) { @@ -12153,6 +12217,8 @@ parseit: TRUE, /* => charclass */ strict)) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); goto parseit; } } @@ -12346,7 +12412,6 @@ parseit: value = grok_oct(--RExC_parse, &numlen, &flags, NULL); RExC_parse += numlen; if (numlen != 3) { - SAVEFREESV(listsv); /* In case warnings are fatalized */ if (strict) { RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1; vFAIL("Need exactly 3 octal digits"); @@ -12363,7 +12428,6 @@ parseit: form_short_octal_warning(RExC_parse, numlen)); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } if (PL_encoding && value < 0x100) goto recode_encoding; @@ -12387,7 +12451,6 @@ parseit: default: /* Allow \_ to not give an error */ if (!SIZE_ONLY && isWORDCHAR(value) && value != '_') { - SAVEFREESV(listsv); if (strict) { vFAIL2("Unrecognized escape \\%c in character class", (int)value); @@ -12399,7 +12462,6 @@ parseit: (int)value); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } break; } /* End of switch on char following backslash */ @@ -12445,7 +12507,6 @@ parseit: const int w = (RExC_parse >= rangebegin) ? RExC_parse - rangebegin : 0; - SAVEFREESV(listsv); /* in case of fatal warnings */ if (strict) { vFAIL4("False [] range \"%*.*s\"", w, w, rangebegin); } @@ -12458,7 +12519,6 @@ parseit: cp_list = add_cp_to_invlist(cp_list, '-'); cp_list = add_cp_to_invlist(cp_list, prevvalue); } - SvREFCNT_inc_simple_void_NN(listsv); } range = 0; /* this was not a true range */ @@ -12994,13 +13054,12 @@ parseit: ret = reg(pRExC_state, 1, ®_flags, depth+1); - *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED); + *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_UTF8); RExC_parse = save_parse; RExC_end = save_end; RExC_in_multi_char_class = 0; SvREFCNT_dec_NN(multi_char_matches); - SvREFCNT_dec_NN(listsv); return ret; } @@ -13158,7 +13217,6 @@ parseit: RExC_parse = (char *) cur_parse; SvREFCNT_dec(posixes); - SvREFCNT_dec_NN(listsv); SvREFCNT_dec(cp_list); return ret; } @@ -13526,7 +13584,7 @@ parseit: else { RExC_emit = orig_emit; } - return END; + return orig_emit; } /* If we didn't do folding, it's because some information isn't available @@ -13660,7 +13718,6 @@ parseit: } SvREFCNT_dec_NN(cp_list); - SvREFCNT_dec_NN(listsv); return ret; } } @@ -13748,7 +13805,6 @@ parseit: && ! HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) { ARG_SET(ret, ANYOF_NONBITMAP_EMPTY); - SvREFCNT_dec_NN(listsv); } else { /* av[0] stores the character class description in its textual form: @@ -13765,8 +13821,7 @@ parseit: SV *rv; av_store(av, 0, (HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) - ? listsv - : (SvREFCNT_dec_NN(listsv), &PL_sv_undef)); + ? SvREFCNT_inc(listsv) : &PL_sv_undef); if (swash) { av_store(av, 1, swash); SvREFCNT_dec_NN(cp_list); diff --git a/regcomp.h b/regcomp.h index 9607326..21d2e1e 100644 --- a/regcomp.h +++ b/regcomp.h @@ -829,11 +829,9 @@ re.pm, especially to the documentation. if (re_debug_flags & RE_DEBUG_EXTRA_GPOS) x ) /* initialization */ -/* get_sv() can return NULL during global destruction. re_debug_flags can get - * clobbered by a longjmp, so must be initialized */ +/* get_sv() can return NULL during global destruction. */ #define GET_RE_DEBUG_FLAGS DEBUG_r({ \ SV * re_debug_flags_sv = NULL; \ - re_debug_flags = 0; \ re_debug_flags_sv = get_sv(RE_DEBUG_FLAGS, 1); \ if (re_debug_flags_sv) { \ if (!SvIOK(re_debug_flags_sv)) \ diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index fc13c6f..faa8859 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -1082,6 +1082,14 @@ sub run_tests { eval "q(W) =~ /\\N{$name}/"; ok ! $w, 'Verify that latin1 letter in name doesnt give warning'; + # This tests the code path that restarts the parse when the recursive + # call to S_reg() from within S_grok_bslash_N() discovers that the + # pattern needs to be recalculated as UTF-8. use eval to avoid + # needing literal Unicode in this source file: + my $r = eval "qr/\\N{\x{100}\x{100}}/"; + isnt $r, undef, "Generated regex for multi-char UTF-8 charname" + or diag($@); + ok "\x{100}\x{100}" =~ $r, "which matches"; } { diff --git a/t/re/re_tests b/t/re/re_tests index c41d529..7e7fc85 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -592,6 +592,10 @@ x(~~)*(?:(?:F)?)? x~~ y - - (?(?!a)b|a) a y $& a (?(?=a)b|a) a n - - (?(?=a)a|b) a y $& a +(?(?!\x{100})\x{100}|b) \x{100} n - - +(?(?!\x{100})b|\x{100}) \x{100} y $& \x{100} +(?(?=\x{100})b|\x{100}) \x{100} n - - +(?(?=\x{100})\x{100}|b) \x{100} y $& \x{100} (?=(a+?))(\1ab) aaab y $2 aab ^(?=(a+?))\1ab aaab n - - (\w+:)+ one: y $1 one: -- Perl5 Master Repository
