In perl.git, the branch smoke-me/remove-regcomp-setjmp has been updated <http://perl5.git.perl.org/perl.git/commitdiff/810f81c8c1730e648f9fc495bd81de8a639c675c?hp=c10ee8c658902092ff2f64ca6058b98ba96b6920>
- Log ----------------------------------------------------------------- commit 810f81c8c1730e648f9fc495bd81de8a639c675c Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 20:58:49 2013 +0100 Revert "PATCH: regex longjmp flaws" This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc. The netbsd - 5.0.2 compiler pointed out that the recent changes to add longjmps to speed up some regex compilations can result in clobbering a few values. These depend on the compiled code, and so didn't show up in other compiler's warnings. This patch reinitializes them after a longjmp. [With a lot of hand editing in regcomp.c, to propagate the changes through subsequent commits.] M regcomp.c M regcomp.h commit 8bc1a6c9dee5e5ac3d374ca654efba63647b6910 Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 20:32:01 2013 +0100 In Perl_re_op_compile(), tidy up after removing setjmp(). Remove volatile qualifiers. Remove the variable jump_ret. Move the initialisation of restudied back to the declaration. This reverts several of the changes made by commits 5d51ce98fae3de07 and bbd61b5ffb7621c2. However, I can't see a cleaner way to avoid code duplication when restarting the parse than to approach I've taken here - the label redo_first_pass is now inside an if (0) block, which is clear but ugly. M embed.fnc M proto.h M regcomp.c commit dcaf3459f8ce6b3fd956132377e8ad41b9d49dd8 Author: Nicholas Clark <[email protected]> Date: Mon Jan 21 17:15:30 2013 +0100 Replace the longjmp()s in Perl_re_op_compile() with goto. The regex parse needs to be restarted if it turns out that it should be done as UTF-8, not bytes. Using setjmp()/longjmp() complicates compilation considerably, causing warnings about missing use of volatile, and hitting code generation errors from clang's ASAN. Using goto is much clearer. M regcomp.c commit 49115de05eb16e7f65afab7e160e15334dfccbd4 Author: Nicholas Clark <[email protected]> Date: Sat Jan 19 11:06:10 2013 +0100 Move the longjmp() that implements REQUIRE_UTF8 up to Perl_re_op_compile(). With longjmp() and setjmp() now in the same function (and all tests passing), it becomes easy to replace the pair with a goto. Still evil, but "the lesser of two evils". M regcomp.c commit da2d0109e35d7269445b5199f6f4d10abba65644 Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 17:21:03 2013 +0100 Add a flag RESTART_UTF8 to the reg*() routines in regcomp.c Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to permit the parse to be restarted without using longjmp(). However, it's not used yet. M regcomp.c commit ba7cf480dd211e0e7572eb612e252b3ca3819980 Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 11:32:44 2013 +0100 In S_regclass(), create listsv as a mortal, claiming a reference if needed. The SV listsv is sometimes stored in an array generated near the end of S_regclass(). In other cases it is not used, and it needs to be freed if any of the warnings that S_regclass() can trigger turn out to be fatal. The simplest solution to this problem is to declare it from the start as a mortal, and claim a (new) reference to it if it is *not* to be freed. This permits the removal of all other code related to ensuring that it is freed at the right time, but not freed prematurely if a call to a warning returns. M embed.fnc M embed.h M proto.h M regcomp.c commit e6731cc09b686669ef22a09a2911891904b62516 Author: Nicholas Clark <[email protected]> Date: Thu Jan 17 11:47:13 2013 +0100 Document when and why S_reg{,branch,piece,atom,class}() return NULL. As documented in pod/perlreguts.pod, the call graph for regex parsing involves several levels of functions in regcomp.c, sometimes recursing more than once. The top level compiling function, S_reg(), calls S_regbranch() to parse each single branch of an alternation. In turn, that calls S_regpiece() to parse a simple pattern followed by quantifier, which calls S_regatom() to parse that simple pattern. S_regatom() can call S_regclass() to handle classes, but can also recurse into S_reg() to handle subpatterns and some other constructions. Some other routines call call S_reg(), sometimes using an alternative pattern that they generate dynamically to represent their input. These routines all return a pointer to a regnode structure, and take a pointer to an integer that holds flags, which is also used to return information. Historically, it has not been clear when and why they return NULL, and whether the return value can be ignored. In particular, "Jumbo regexp patch" (commit c277df42229d99fe, from Nov 1997), added code with two calls from S_reg() to S_regbranch(), one of which checks the return value and generates a LONGJMP node if it returns NULL, the other of which is called in void context, and so both ignores any return value, or the possibility that it is NULL. After some analysis I have untangled the possible return values from these 5 functions (and related functions which call S_reg()). Starting from the top: S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma- line constructions that it handles. Otherwise, historically it would return NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece() returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL. S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it has similar behaviour when passing back return values, although often it is able to loop instead on getting a TRYAGAIN. Which gets us back to S_reg(), which can only *generate* NULL in conjunction with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it called generated it. All other functions that these call that return regnode structures cannot return NULL. Hence 1) in the loop of functions called, there is no source for a return value of NULL without the TRYAGAIN flag being set 2) a return value of NULL with TRYAGAIN set from an inner function does not propagate out past S_regbranch() Hence the only return values that most functions can generate are non-NULL, or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot return NULL. The longest sequence of functions that can return NULL (with TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch(). Rapidly returning right round the loop back to S_reg() is not possible. Hence code added by commit c277df42229d99fe to handle a NULL return from S_regbranch(), along with some other code is dead. I have replaced all unreachable code with FAIL()s that panic. M regcomp.c commit bbc20ad312647f0e1e3c57f1f8fa7ceb67b01ce9 Author: Nicholas Clark <[email protected]> Date: Fri Jan 18 16:30:39 2013 +0100 Return orig_emit from S_regclass() when ret_invlist is true. The return value isn't used (yet). Previously the code was returning END, which is a macro for the regnode number for "End of program" which happens to be 0. It happens that 0 is valid C for a NULL pointer, but using it in this way makes the intent unclear. Not only is orig_emit a valid type, it's semantically the correct thing to return, as it's most recently added node. M regcomp.c commit 6fbf949a3ab4df75eaea72f38e8adc688b3749f9 Author: Nicholas Clark <[email protected]> Date: Wed Jan 16 21:58:02 2013 +0100 Test that UTF-8 in the look-ahead of (?(?=...)...) restarts the sizing parse. S_reg() recurses to itself to parse various constructions used as the conditionals in conditional matching. Look-aheads and look-behinds can turn out to need to be sized as UTF-8, which can cause the inner S_reg() to use the macro REQUIRE_UTF8 is used to restart the parse. Test that this is handled correctly. M t/re/re_tests commit 3d639504eb1c440ce4affb35dee21989b241771b Author: Nicholas Clark <[email protected]> Date: Wed Jan 16 17:08:03 2013 +0100 Test that S_grok_bslash_N() copes if S_reg() restarts the sizing parse. S_reg() can discover midway through parsing the pattern to determine its size, that the pattern will actually need to be encoded as UTF-8. If calculations so far have been done in terms of bytes, then the macro REQUIRE_UTF8 is used to restart the parse, so that sizes can be calculated correctly for UTF-8. It is possible to trigger this restart when processing multi-character charnames interpolated into the pattern using \N{}. Test that this is handled correctly. M t/re/pat_advanced.t commit ee9c233b469f709a1962d7bdff4d21e5231b9aee Author: Nicholas Clark <[email protected]> Date: Wed Jan 16 11:48:04 2013 +0100 Perl_sv_uni_display() needs to be aware of RX_WRAPPED() Commit 8d919b0a35f2b57a changed the storage location of the string in SVt_REGEXP. It updated most code to deal with this, but missed the use of SvPVX_const() in Perl_sv_uni_display(). This breaks dumping regular expressions which have the UTF-8 flag set. M ext/Devel-Peek/t/Peek.t M utf8.c commit 60c42c5fa66c9e9ebfc683f2b3a6fa0261d1585a Author: Nicholas Clark <[email protected]> Date: Mon Jan 14 09:46:48 2013 +0100 Remove unreachable duplicate (?#...) parsing code from S_reg() I believe that this code was rendered unreachable when perl 5.001 added code to S_nextchar() to skip over embedded comments. Adrian Enache noted this in March 2003, and proposed a patch which removed it. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html The patch wasn't applied at that time, and when he sent it again August, he omitted that hunk. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html That version was applied as commit e994fd663a4d8acc. M regcomp.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 4 +- embed.h | 2 +- ext/Devel-Peek/t/Peek.t | 35 ++++++ proto.h | 4 +- regcomp.c | 299 ++++++++++++++++++++++++++++------------------- regcomp.h | 4 +- t/re/pat_advanced.t | 8 ++ t/re/re_tests | 4 + utf8.c | 5 +- 9 files changed, 234 insertions(+), 131 deletions(-) diff --git a/embed.fnc b/embed.fnc index b2b46f4..10c0111 100644 --- a/embed.fnc +++ b/embed.fnc @@ -1105,7 +1105,7 @@ Ap |REGEXP*|pregcomp |NN SV * const pattern|const U32 flags p |REGEXP*|re_op_compile |NULLOK SV ** const patternp \ |int pat_count|NULLOK OP *expr \ |NN const regexp_engine* eng \ - |NULLOK REGEXP *VOL old_re \ + |NULLOK REGEXP *old_re \ |NULLOK bool *is_bare_re \ |U32 rx_flags|U32 pm_flags Ap |REGEXP*|re_compile |NN SV * const pattern|U32 orig_rx_flags @@ -2033,7 +2033,7 @@ EsRn |U32 |add_data |NN struct RExC_state_t *pRExC_state|U32 n \ |NN const char *s rs |void |re_croak2 |NN const char* pat1|NN const char* pat2|... Ei |I32 |regpposixcc |NN struct RExC_state_t *pRExC_state \ - |I32 value|NULLOK SV *free_me|const bool strict + |I32 value|const bool strict Es |I32 |make_trie |NN struct RExC_state_t *pRExC_state \ |NN regnode *startbranch|NN regnode *first \ |NN regnode *last|NN regnode *tail \ diff --git a/embed.h b/embed.h index e9a1980..9a7c439 100644 --- a/embed.h +++ b/embed.h @@ -942,7 +942,7 @@ #define reginsert(a,b,c,d) S_reginsert(aTHX_ a,b,c,d) #define regpatws S_regpatws #define regpiece(a,b,c) S_regpiece(aTHX_ a,b,c) -#define regpposixcc(a,b,c,d) S_regpposixcc(aTHX_ a,b,c,d) +#define regpposixcc(a,b,c) S_regpposixcc(aTHX_ a,b,c) #define regtail(a,b,c,d) S_regtail(aTHX_ a,b,c,d) #define reguni(a,b,c) S_reguni(aTHX_ a,b,c) #define regwhite S_regwhite diff --git a/ext/Devel-Peek/t/Peek.t b/ext/Devel-Peek/t/Peek.t index b3dbc9b..a8b0d22 100644 --- a/ext/Devel-Peek/t/Peek.t +++ b/ext/Devel-Peek/t/Peek.t @@ -938,4 +938,39 @@ unless ($Config{useithreads}) { close OUT; } +do_test('UTF-8 in a regular expression', + qr/\x{100}/, +'SV = IV\($ADDR\) at $ADDR + REFCNT = 1 + FLAGS = \(ROK\) + RV = $ADDR + SV = REGEXP\($ADDR\) at $ADDR + REFCNT = 1 + FLAGS = \(OBJECT,FAKE,UTF8\) + PV = $ADDR "\(\?\^u:\\\\\\\\x\{100\}\)" \[UTF8 "\(\?\^u:\\\\\\\\x\{100\}\)"\] + CUR = 13 + STASH = $ADDR "Regexp" + EXTFLAGS = 0x680040 \(CHECK_ALL,USE_INTUIT_NOML,USE_INTUIT_ML\) + INTFLAGS = 0x0 + NPARENS = 0 + LASTPAREN = 0 + LASTCLOSEPAREN = 0 + MINLEN = 1 + MINLENRET = 1 + GOFS = 0 + PRE_PREFIX = 5 + SUBLEN = 0 + SUBOFFSET = 0 + SUBCOFFSET = 0 + SUBBEG = 0x0 + ENGINE = $ADDR + MOTHER_RE = $ADDR + PAREN_NAMES = 0x0 + SUBSTRS = $ADDR + PPRIVATE = $ADDR + OFFS = $ADDR + QR_ANONCV = 0x0 + SAVED_COPY = 0x0 +'); + done_testing(); diff --git a/proto.h b/proto.h index 67eea1d..c931bd9 100644 --- a/proto.h +++ b/proto.h @@ -3259,7 +3259,7 @@ PERL_CALLCONV SV* Perl_re_intuit_string(pTHX_ REGEXP *const r) #define PERL_ARGS_ASSERT_RE_INTUIT_STRING \ assert(r) -PERL_CALLCONV REGEXP* Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, OP *expr, const regexp_engine* eng, REGEXP *VOL old_re, bool *is_bare_re, U32 rx_flags, U32 pm_flags) +PERL_CALLCONV REGEXP* Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, OP *expr, const regexp_engine* eng, REGEXP *old_re, bool *is_bare_re, U32 rx_flags, U32 pm_flags) __attribute__nonnull__(pTHX_4); #define PERL_ARGS_ASSERT_RE_OP_COMPILE \ assert(eng) @@ -6700,7 +6700,7 @@ STATIC regnode* S_regpiece(pTHX_ struct RExC_state_t *pRExC_state, I32 *flagp, U #define PERL_ARGS_ASSERT_REGPIECE \ assert(pRExC_state); assert(flagp) -PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, SV *free_me, const bool strict) +PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, const bool strict) __attribute__nonnull__(pTHX_1); #define PERL_ARGS_ASSERT_REGPPOSIXCC \ assert(pRExC_state) diff --git a/regcomp.c b/regcomp.c index a16b8b3..10d7efc 100644 --- a/regcomp.c +++ b/regcomp.c @@ -231,8 +231,9 @@ typedef struct RExC_state_t { * REGNODE_SIMPLE */ #define SIMPLE 0x02 #define SPSTART 0x04 /* Starts with * or + */ -#define TRYAGAIN 0x08 /* Weeded out a declaration. */ -#define POSTPONED 0x10 /* (?1),(?&name), (??{...}) or similar */ +#define POSTPONED 0x08 /* (?1),(?&name), (??{...}) or similar */ +#define TRYAGAIN 0x10 /* Weeded out a declaration. */ +#define RESTART_UTF8 0x20 /* Restart, need to calcuate sizes as UTF-8 */ #define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1) @@ -251,10 +252,11 @@ typedef struct RExC_state_t { #define PAREN_SET(u8str,paren) PBYTE(u8str,paren) |= PBITVAL(paren) #define PAREN_UNSET(u8str,paren) PBYTE(u8str,paren) &= (~PBITVAL(paren)) -/* If not already in utf8, do a longjmp back to the beginning */ -#define UTF8_LONGJMP 42 /* Choose a value not likely to ever conflict */ #define REQUIRE_UTF8 STMT_START { \ - if (! UTF) JMPENV_JUMP(UTF8_LONGJMP); \ + if (!UTF) { \ + *flagp = RESTART_UTF8; \ + return NULL; \ + } \ } STMT_END /* This converts the named class defined in regcomp.h to its equivalent class @@ -5162,7 +5164,7 @@ S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, S REGEXP * Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, - OP *expr, const regexp_engine* eng, REGEXP *VOL old_re, + OP *expr, const regexp_engine* eng, REGEXP *old_re, bool *is_bare_re, U32 orig_rx_flags, U32 pm_flags) { dVAR; @@ -5170,32 +5172,29 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, struct regexp *r; regexp_internal *ri; STRLEN plen; - char * VOL exp; + char *exp; char* xend; regnode *scan; I32 flags; I32 minlen = 0; U32 rx_flags; - SV * VOL pat; - SV * VOL code_blocksv = NULL; + SV *pat; + SV *code_blocksv = NULL; /* these are all flags - maybe they should be turned * into a single int with different bit masks */ I32 sawlookahead = 0; I32 sawplus = 0; I32 sawopen = 0; - bool used_setjump = FALSE; regex_charset initial_charset = get_regex_charset(orig_rx_flags); bool code_is_utf8 = 0; - bool VOL recompile = 0; + bool recompile = 0; bool runtime_code = 0; - U8 jump_ret = 0; - dJMPENV; scan_data_t data; RExC_state_t RExC_state; RExC_state_t * const pRExC_state = &RExC_state; #ifdef TRIE_STUDY_OPT - int restudied; + int restudied = 0; RExC_state_t copyRExC_state; #endif GET_RE_DEBUG_FLAGS_DECL; @@ -5513,6 +5512,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, } exp = SvPV_nomg(pat, plen); + xend = exp + plen; if (!eng->op_comp) { if ((SvUTF8(pat) && IN_BYTES) @@ -5533,39 +5533,22 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, RExC_contains_locale = 0; pRExC_state->runtime_code_qr = NULL; - /****************** LONG JUMP TARGET HERE***********************/ - /* Longjmp back to here if have to switch in midstream to utf8 */ - if (! RExC_orig_utf8) { - JMPENV_PUSH(jump_ret); - used_setjump = TRUE; - } - - if (jump_ret == 0) { /* First time through */ - xend = exp + plen; - - DEBUG_COMPILE_r({ + DEBUG_COMPILE_r({ SV *dsv= sv_newmortal(); - RE_PV_QUOTED_DECL(s, RExC_utf8, - dsv, exp, plen, 60); + RE_PV_QUOTED_DECL(s, RExC_utf8, dsv, exp, plen, 60); PerlIO_printf(Perl_debug_log, "%sCompiling REx%s %s\n", - PL_colors[4],PL_colors[5],s); + PL_colors[4],PL_colors[5],s); }); - } - else { /* longjumped back */ - U8 *src, *dst; + + if (0) { + redo_first_pass: + { + U8 *const src = (U8*)exp; + U8 *dst; int n=0; STRLEN s = 0, d = 0; bool do_end = 0; - /* If the cause for the longjmp was other than changing to utf8, pop - * our own setjmp, and longjmp to the correct handler */ - if (jump_ret != UTF8_LONGJMP) { - JMPENV_POP; - JMPENV_JUMP(jump_ret); - } - - GET_RE_DEBUG_FLAGS; - /* It's possible to write a regexp in ascii that represents Unicode codepoints outside of the byte range, such as via \x{100}. If we detect such a sequence we have to convert the entire pattern to utf8 @@ -5581,7 +5564,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, * recalculate the indices. * This is essentially an unrolled Perl_bytes_to_utf8() */ - src = (U8*)SvPV_nomg(pat, plen); Newx(dst, plen * 2 + 1, U8); while (s < plen) { @@ -5614,6 +5596,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, xend = exp + plen; SAVEFREEPV(exp); RExC_orig_utf8 = RExC_utf8 = 1; + } } /* return old regex if pattern hasn't changed */ @@ -5629,9 +5612,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, runtime_code = S_has_runtime_code(aTHX_ pRExC_state, expr, pm_flags, exp, plen); if (!runtime_code) { - if (used_setjump) { - JMPENV_POP; - } Safefree(pRExC_state->code_blocks); return old_re; } @@ -5645,10 +5625,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, runtime_code = S_has_runtime_code(aTHX_ pRExC_state, expr, pm_flags, exp, plen); -#ifdef TRIE_STUDY_OPT - restudied = 0; -#endif - rx_flags = orig_rx_flags; if (initial_charset == REGEX_LOCALE_CHARSET) { @@ -5672,7 +5648,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, if (!S_compile_runtime_code(aTHX_ pRExC_state, exp, plen)) { /* whoops, we have a non-utf8 pattern, whilst run-time code * got compiled as utf8. Try again with a utf8 pattern */ - JMPENV_JUMP(UTF8_LONGJMP); + goto redo_first_pass; } } assert(!pRExC_state->runtime_code_qr); @@ -5728,17 +5704,14 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, SvLEN_set(code_blocksv, 1); /*sufficient to make sv_clear free it*/ } if (reg(pRExC_state, 0, &flags,1) == NULL) { - RExC_precomp = NULL; - return(NULL); + if (flags & RESTART_UTF8) { + goto redo_first_pass; + } + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing pass, flags=%#X", flags); } if (code_blocksv) SvLEN_set(code_blocksv,0); /* no you can't have it, sv_clear */ - /* Here, finished first pass. Get rid of any added setjmp */ - if (used_setjump) { - JMPENV_POP; - } - DEBUG_PARSE_r({ PerlIO_printf(Perl_debug_log, "Required size %"IVdf" nodes\n" @@ -5904,7 +5877,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, REGC((U8)REG_MAGIC, (char*) RExC_emit++); if (reg(pRExC_state, 0, &flags,1) == NULL) { ReREFCNT_dec(rx); - return(NULL); + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for generation pass, flags=%#X", flags); } /* XXXX To minimize changes to RE engine we always allocate 3-units-long substrs field. */ @@ -8321,6 +8294,11 @@ S__invlistEQ(pTHX_ SV* const a, SV* const b, bool complement_b) #define REGTAIL_STUDY(x,y,z) regtail((x),(y),(z),depth+1) #endif +/* Returns NULL, setting *flagp to TRYAGAIN at the end of (?) that only sets + flags. Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan + needs to be restarted. + Otherwise would only return NULL if regbranch() returns NULL, which + cannot happen. */ STATIC regnode * S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* paren: Parenthesized? 0=top, 1=(, inside: changed to letter. */ @@ -8630,14 +8608,6 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) case '@': /* (?@...) */ vFAIL2("Sequence (?%c...) not implemented", (int)paren); break; - case '#': /* (?#...) */ - while (*RExC_parse && *RExC_parse != ')') - RExC_parse++; - if (*RExC_parse != ')') - FAIL("Sequence (?#... not terminated"); - nextchar(pRExC_state); - *flagp = TRYAGAIN; - return NULL; case '0' : /* (?0) */ case 'R' : /* (?R) */ if (*RExC_parse != ')') @@ -8803,11 +8773,18 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) || RExC_parse[1] == '<' || RExC_parse[1] == '{') { /* Lookahead or eval. */ I32 flag; + regnode *tail; ret = reg_node(pRExC_state, LOGICAL); if (!SIZE_ONLY) ret->flags = 1; - REGTAIL(pRExC_state, ret, reg(pRExC_state, 1, &flag,depth+1)); + + tail = reg(pRExC_state, 1, &flag, depth+1); + if (flag & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + REGTAIL(pRExC_state, ret, tail); goto insert_if; } } @@ -8875,9 +8852,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) insert_if: REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 0)); br = regbranch(pRExC_state, &flags, 1,depth+1); - if (br == NULL) - br = reganode(pRExC_state, LONGJMP, 0); - else + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } else REGTAIL(pRExC_state, br, reganode(pRExC_state, LONGJMP, 0)); c = *nextchar(pRExC_state); if (flags&HASWIDTH) @@ -8886,7 +8868,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) if (is_define) vFAIL("(?(DEFINE)....) does not allow branches"); lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one for optimizer. */ - regbranch(pRExC_state, &flags, 1,depth+1); + if (!regbranch(pRExC_state, &flags, 1,depth+1)) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } REGTAIL(pRExC_state, ret, lastbr); if (flags&HASWIDTH) *flagp |= HASWIDTH; @@ -9135,8 +9124,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* branch_len = (paren != 0); */ - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } if (*RExC_parse == '|') { if (!SIZE_ONLY && RExC_extralen) { reginsert(pRExC_state, BRANCHJ, br, depth+1); @@ -9175,8 +9169,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) } br = regbranch(pRExC_state, &flags, 0, depth+1); - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } REGTAIL(pRExC_state, lastbr, br); /* BRANCH -> BRANCH. */ lastbr = br; *flagp |= flags & (SPSTART | HASWIDTH | POSTPONED); @@ -9333,6 +9332,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) - regbranch - one alternative of an | operator * * Implements the concatenation operator. + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) @@ -9372,7 +9374,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) if (latest == NULL) { if (flags & TRYAGAIN) continue; - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regpiece returned NULL, flags=%#X", flags); } else if (ret == NULL) ret = latest; @@ -9406,6 +9412,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) * both the endmarker for their branch list and the body of the last branch. * It might seem that this node could be dispensed with entirely, but the * endmarker role is not redundant. + * + * Returns NULL, setting *flagp to TRYAGAIN if regatom() returns NULL with + * TRYAGAIN. + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) @@ -9434,8 +9445,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) ret = regatom(pRExC_state, &flags,depth+1); if (ret == NULL) { - if (flags & TRYAGAIN) - *flagp |= TRYAGAIN; + if (flags & (TRYAGAIN|RESTART_UTF8)) + *flagp |= flags & (TRYAGAIN|RESTART_UTF8); + else + FAIL2("panic: regatom returned NULL, flags=%#X", flags); return(NULL); } @@ -9665,7 +9678,10 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I The function raises an error (via vFAIL), and doesn't return for various syntax errors. Otherwise it returns TRUE and sets <node_p> or <valuep> on - success; it returns FALSE otherwise. + success; it returns FALSE otherwise. Returns FALSE, setting *flagp to + RESTART_UTF8 if the sizing scan needs to be restarted. Such a restart is + only possible if node_p is non-NULL. + If <valuep> is non-null, it means the caller can accept an input sequence consisting of a just a single code point; <*valuep> is set to that value @@ -9869,7 +9885,14 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I /* The values are Unicode, and therefore not subject to recoding */ RExC_override_recoding = 1; - *node_p = reg(pRExC_state, 1, &flags, depth+1); + if (!(*node_p = reg(pRExC_state, 1, &flags, depth+1))) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return FALSE; + } + FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#X", + flags); + } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); RExC_parse = endbrace; @@ -10069,6 +10092,12 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, regnode *node, I32 escape sequences, with the one for handling literal escapes requiring a dummy entry for all of the special escapes that are actually handled by the other. + + Returns NULL, setting *flagp to TRYAGAIN if reg() returns NULL with + TRYAGAIN. + Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + restarted. + Otherwise does not return NULL. */ STATIC regnode * @@ -10136,6 +10165,12 @@ tryagain: RExC_parse = oregcomp_parse; vFAIL("Unmatched ["); } + if (ret == NULL) { + if (*flagp & RESTART_UTF8) + return NULL; + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); + } nextchar(pRExC_state); Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */ break; @@ -10152,7 +10187,11 @@ tryagain: } goto tryagain; } - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: reg returned NULL to regatom, flags=%#X", flags); } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); break; @@ -10344,6 +10383,11 @@ tryagain: It would be a bug if these returned non-portables */ NULL); + /* regclass() can only return RESTART_UTF8 if multi-char folds + are allowed. */ + if (!ret) + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); RExC_parse--; @@ -10366,6 +10410,8 @@ tryagain: ++RExC_parse; if (! grok_bslash_N(pRExC_state, &ret, NULL, flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + return NULL; RExC_parse--; goto defchar; } @@ -10632,6 +10678,8 @@ tryagain: flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); RExC_parse = p = oldp; goto loopdone; } @@ -11180,8 +11228,7 @@ S_regpatws( RExC_state_t *pRExC_state, char *p , const bool recognize_comment ) #define POSIXCC(c) (POSIXCC_DONE(c) || POSIXCC_NOTYET(c)) PERL_STATIC_INLINE I32 -S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, - const bool strict) +S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, const bool strict) { dVAR; I32 namedclass = OOB_NAMEDCLASS; @@ -11305,7 +11352,6 @@ S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, the class closes */ while (UCHARAT(RExC_parse) && UCHARAT(RExC_parse) != ']') RExC_parse++; - SvREFCNT_dec(free_me); vFAIL3("POSIX syntax [%c %c] is reserved for future extensions", c, c); } } else { @@ -11441,13 +11487,18 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char - class only if not a - posix class */ - FALSE, /* don't allow multi-char folds */ - TRUE, /* silence non-portable warnings. */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char + class only if not a + posix class */ + FALSE, /* don't allow multi-char folds */ + TRUE, /* silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); + /* function call leaves parse pointing to the ']', except * if we faked it */ if (is_posix_class) { @@ -11537,12 +11588,15 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, vFAIL("Unexpected character"); case '\\': - (void) regclass(pRExC_state, flagp,depth+1, - TRUE, /* means parse just the next thing */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + TRUE, /* means parse just the next thing */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* regclass() will return with parsing just the \ sequence, * leaving the parse pointer at the next thing to parse */ RExC_parse--; @@ -11556,13 +11610,16 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char class - only if not a posix class */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if(!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char class + only if not a posix class */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* function call leaves parse pointing to the ']', except if we * faked it */ if (is_posix_class) { @@ -11737,6 +11794,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, * already has all folding taken into consideration, and we don't want * regclass() to add to that */ RExC_flags &= ~RXf_PMf_FOLD; + /* regclass() can only return RESTART_UTF8 if multi-char folds are allowed. + */ node = regclass(pRExC_state, flagp,depth+1, FALSE, /* means parse the whole char class */ FALSE, /* don't allow multi-char folds */ @@ -11744,6 +11803,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, well have generated non-portable code points, but they're valid on this machine */ NULL); + if (!node) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", flagp); if (save_fold) { RExC_flags |= RXf_PMf_FOLD; } @@ -11793,7 +11854,11 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, * corresponding bit set if that character is in the list. For characters * above 255, a range list or swash is used. There are extra bits for \w, * etc. in locale ANYOFs, as what these match is not determinable at - * compile time */ + * compile time + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs + * to be restarted. This can only happen if ret_invlist is non-NULL. + */ dVAR; UV prevvalue = OOB_UNICODE, save_prevvalue = OOB_UNICODE; @@ -11879,8 +11944,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, if (LOC) { ANYOF_FLAGS(ret) |= ANYOF_LOCALE; } - listsv = newSVpvs("# comment\n"); + listsv = newSVpvs_flags("# comment\n", SVs_TEMP); initial_listsv_len = SvCUR(listsv); + SvTEMP_off(listsv); /* Grr, TEMPs and mortals are conflated. */ } if (skip_white) { @@ -11908,12 +11974,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, s++; if (*s && c == *s && s[1] == ']') { SAVEFREESV(RExC_rx_sv); - SAVEFREESV(listsv); ckWARN3reg(s+2, "POSIX syntax [%c %c] belongs inside character classes", c, c); (void)ReREFCNT_inc(RExC_rx_sv); - SvREFCNT_inc_simple_void_NN(listsv); } } @@ -11965,7 +12029,7 @@ parseit: && RExC_parse < RExC_end && POSIXCC(UCHARAT(RExC_parse))) { - namedclass = regpposixcc(pRExC_state, value, listsv, strict); + namedclass = regpposixcc(pRExC_state, value, strict); } else if (value == '\\') { if (UTF) { @@ -12010,6 +12074,8 @@ parseit: TRUE, /* => charclass */ strict)) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); goto parseit; } } @@ -12203,7 +12269,6 @@ parseit: value = grok_oct(--RExC_parse, &numlen, &flags, NULL); RExC_parse += numlen; if (numlen != 3) { - SAVEFREESV(listsv); /* In case warnings are fatalized */ if (strict) { RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1; vFAIL("Need exactly 3 octal digits"); @@ -12220,7 +12285,6 @@ parseit: form_short_octal_warning(RExC_parse, numlen)); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } if (PL_encoding && value < 0x100) goto recode_encoding; @@ -12244,7 +12308,6 @@ parseit: default: /* Allow \_ to not give an error */ if (!SIZE_ONLY && isWORDCHAR(value) && value != '_') { - SAVEFREESV(listsv); if (strict) { vFAIL2("Unrecognized escape \\%c in character class", (int)value); @@ -12256,7 +12319,6 @@ parseit: (int)value); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } break; } /* End of switch on char following backslash */ @@ -12302,7 +12364,6 @@ parseit: const int w = (RExC_parse >= rangebegin) ? RExC_parse - rangebegin : 0; - SAVEFREESV(listsv); /* in case of fatal warnings */ if (strict) { vFAIL4("False [] range \"%*.*s\"", w, w, rangebegin); } @@ -12315,7 +12376,6 @@ parseit: cp_list = add_cp_to_invlist(cp_list, '-'); cp_list = add_cp_to_invlist(cp_list, prevvalue); } - SvREFCNT_inc_simple_void_NN(listsv); } range = 0; /* this was not a true range */ @@ -12851,13 +12911,12 @@ parseit: ret = reg(pRExC_state, 1, ®_flags, depth+1); - *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED); + *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_UTF8); RExC_parse = save_parse; RExC_end = save_end; RExC_in_multi_char_class = 0; SvREFCNT_dec_NN(multi_char_matches); - SvREFCNT_dec_NN(listsv); return ret; } @@ -13015,7 +13074,6 @@ parseit: RExC_parse = (char *) cur_parse; SvREFCNT_dec(posixes); - SvREFCNT_dec_NN(listsv); SvREFCNT_dec(cp_list); return ret; } @@ -13383,7 +13441,7 @@ parseit: else { RExC_emit = orig_emit; } - return END; + return orig_emit; } /* If we didn't do folding, it's because some information isn't available @@ -13517,7 +13575,6 @@ parseit: } SvREFCNT_dec_NN(cp_list); - SvREFCNT_dec_NN(listsv); return ret; } } @@ -13605,7 +13662,6 @@ parseit: && ! HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) { ARG_SET(ret, ANYOF_NONBITMAP_EMPTY); - SvREFCNT_dec_NN(listsv); } else { /* av[0] stores the character class description in its textual form: @@ -13622,8 +13678,7 @@ parseit: SV *rv; av_store(av, 0, (HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) - ? listsv - : (SvREFCNT_dec_NN(listsv), &PL_sv_undef)); + ? SvREFCNT_inc(listsv) : &PL_sv_undef); if (swash) { av_store(av, 1, swash); SvREFCNT_dec_NN(cp_list); diff --git a/regcomp.h b/regcomp.h index 9607326..21d2e1e 100644 --- a/regcomp.h +++ b/regcomp.h @@ -829,11 +829,9 @@ re.pm, especially to the documentation. if (re_debug_flags & RE_DEBUG_EXTRA_GPOS) x ) /* initialization */ -/* get_sv() can return NULL during global destruction. re_debug_flags can get - * clobbered by a longjmp, so must be initialized */ +/* get_sv() can return NULL during global destruction. */ #define GET_RE_DEBUG_FLAGS DEBUG_r({ \ SV * re_debug_flags_sv = NULL; \ - re_debug_flags = 0; \ re_debug_flags_sv = get_sv(RE_DEBUG_FLAGS, 1); \ if (re_debug_flags_sv) { \ if (!SvIOK(re_debug_flags_sv)) \ diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index 29a64dd..cd067f4 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -1082,6 +1082,14 @@ sub run_tests { eval "q(W) =~ /\\N{$name}/"; ok ! $w, 'Verify that latin1 letter in name doesnt give warning'; + # This tests the code path that restarts the parse when the recursive + # call to S_reg() from within S_grok_bslash_N() discovers that the + # pattern needs to be recalculated as UTF-8. use eval to avoid + # needing literal Unicode in this source file: + my $r = eval "qr/\\N{\x{100}\x{100}}/"; + isnt $r, undef, "Generated regex for multi-char UTF-8 charname" + or diag($@); + ok "\x{100}\x{100}" =~ $r, "which matches"; } { diff --git a/t/re/re_tests b/t/re/re_tests index c41d529..7e7fc85 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -592,6 +592,10 @@ x(~~)*(?:(?:F)?)? x~~ y - - (?(?!a)b|a) a y $& a (?(?=a)b|a) a n - - (?(?=a)a|b) a y $& a +(?(?!\x{100})\x{100}|b) \x{100} n - - +(?(?!\x{100})b|\x{100}) \x{100} y $& \x{100} +(?(?=\x{100})b|\x{100}) \x{100} n - - +(?(?=\x{100})\x{100}|b) \x{100} y $& \x{100} (?=(a+?))(\1ab) aaab y $2 aab ^(?=(a+?))\1ab aaab n - - (\w+:)+ one: y $1 one: diff --git a/utf8.c b/utf8.c index 1bf3f52..ec3a0ba 100644 --- a/utf8.c +++ b/utf8.c @@ -4423,9 +4423,12 @@ The pointer to the PV of the C<dsv> is returned. char * Perl_sv_uni_display(pTHX_ SV *dsv, SV *ssv, STRLEN pvlim, UV flags) { + const char * const ptr = + isREGEXP(ssv) ? RX_WRAPPED((REGEXP*)ssv) : SvPVX_const(ssv); + PERL_ARGS_ASSERT_SV_UNI_DISPLAY; - return Perl_pv_uni_display(aTHX_ dsv, (const U8*)SvPVX_const(ssv), + return Perl_pv_uni_display(aTHX_ dsv, (const U8*)ptr, SvCUR(ssv), pvlim, flags); } -- Perl5 Master Repository
