In perl.git, the branch smoke-me/remove-regcomp-setjmp has been updated <http://perl5.git.perl.org/perl.git/commitdiff/470c26f2012b3f4f6540ae6ee7a595b049974699?hp=36baafc9b54ecc5cdea82552276a70b5218958bb>
- Log ----------------------------------------------------------------- commit 470c26f2012b3f4f6540ae6ee7a595b049974699 Author: Nicholas Clark <n...@ccl4.org> Date: Sat Jan 19 11:06:10 2013 +0100 Move the longjmp() that implements REQUIRE_UTF8 up to Perl_re_op_compile(). With longjmp() and setjmp() now in the same function (and all tests passing), it becomes easy to replace the pair with a goto. Still evil, but "the lesser of two evils". M regcomp.c commit 9e18d992386fc2004b8565aa28bfca69ef0b0532 Author: Nicholas Clark <n...@ccl4.org> Date: Fri Jan 18 17:21:03 2013 +0100 Add a flag RESTART_UTF8 to the reg*() routines in regcomp.c Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to permit the parse to be restarted without using longjmp(). However, it's not used yet. M regcomp.c commit 9896e62575838e26122c6cf2e600303e136e0fbf Author: Nicholas Clark <n...@ccl4.org> Date: Fri Jan 18 11:32:44 2013 +0100 In S_regclass(), create listsv as a mortal, claiming a reference if needed. The SV listsv is sometimes stored in an array generated near the end of S_regclass(). In other cases it is not used, and it needs to be freed if any of the warnings that S_regclass() can trigger turn out to be fatal. The simplest solution to this problem is to declare it from the start as a mortal, and claim a (new) reference to it if it is *not* to be freed. This permits the removal of all other code related to ensuring that it is freed at the right time, but not freed prematurely if a call to a warning returns. M embed.fnc M embed.h M proto.h M regcomp.c commit 7ed73f7b0ef42242d4b31c1b414ae4d872cb8f05 Author: Nicholas Clark <n...@ccl4.org> Date: Thu Jan 17 11:47:13 2013 +0100 Document when and why S_reg{,branch,piece,atom,class}() return NULL. As documented in pod/perlreguts.pod, the call graph for regex parsing involves several levels of functions in regcomp.c, sometimes recursing more than once. The top level compiling function, S_reg(), calls S_regbranch() to parse each single branch of an alternation. In turn, that calls S_regpiece() to parse a simple pattern followed by quantifier, which calls S_regatom() to parse that simple pattern. S_regatom() can call S_regclass() to handle classes, but can also recurse into S_reg() to handle subpatterns and some other constructions. Some other routines call call S_reg(), sometimes using an alternative pattern that they generate dynamically to represent their input. These routines all return a pointer to a regnode structure, and take a pointer to an integer that holds flags, which is also used to return information. Historically, it has not been clear when and why they return NULL, and whether the return value can be ignored. In particular, "Jumbo regexp patch" (commit c277df42229d99fe, from Nov 1997), added code with two calls from S_reg() to S_regbranch(), one of which checks the return value and generates a LONGJMP node if it returns NULL, the other of which is called in void context, and so both ignores any return value, or the possibility that it is NULL. After some analysis I have untangled the possible return values from these 5 functions (and related functions which call S_reg()). Starting from the top: S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma- line constructions that it handles. Otherwise, historically it would return NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece() returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL. S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it has similar behaviour when passing back return values, although often it is able to loop instead on getting a TRYAGAIN. Which gets us back to S_reg(), which can only *generate* NULL in conjunction with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it called generated it. All other functions that these call that return regnode structures cannot return NULL. Hence 1) in the loop of functions called, there is no source for a return value of NULL without the TRYAGAIN flag being set 2) a return value of NULL with TRYAGAIN set from an inner function does not propagate out past S_regbranch() Hence the only return values that most functions can generate are non-NULL, or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot return NULL. The longest sequence of functions that can return NULL (with TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch(). Rapidly returning right round the loop back to S_reg() is not possible. Hence code added by commit c277df42229d99fe to handle a NULL return from S_regbranch(), along with some other code is dead. I have replaced all unreachable code with FAIL()s that panic. M regcomp.c commit 22bd5ba57cf1bc90e91482b7bc64453e0a9516d7 Author: Nicholas Clark <n...@ccl4.org> Date: Fri Jan 18 16:30:39 2013 +0100 Return orig_emit from S_regclass() when ret_invlist is true. The return value isn't used (yet). Previously the code was returning END, which is a macro for the regnode number for "End of program" which happens to be 0. It happens that 0 is valid C for a NULL pointer, but using it in this way makes the intent unclear. Not only is orig_emit a valid type, it's semantically the correct thing to return, as it's most recently added node. M regcomp.c commit 36d78074bb875bbca35cab1faa9dcf5ab9b33755 Author: Nicholas Clark <n...@ccl4.org> Date: Wed Jan 16 21:58:02 2013 +0100 Test that UTF-8 in the look-ahead of (?(?=...)...) restarts the sizing parse. S_reg() recurses to itself to parse various constructions used as the conditionals in conditional matching. Look-aheads and look-behinds can turn out to need to be sized as UTF-8, which can cause the inner S_reg() to use the macro REQUIRE_UTF8 is used to restart the parse. Test that this is handled correctly. M t/re/re_tests commit 6bfd9d40bc3b041083654c5119f467f9a6316424 Author: Nicholas Clark <n...@ccl4.org> Date: Wed Jan 16 17:08:03 2013 +0100 Test that S_grok_bslash_N() copes if S_reg() restarts the sizing parse. S_reg() can discover midway through parsing the pattern to determine its size, that the pattern will actually need to be encoded as UTF-8. If calculations so far have been done in terms of bytes, then the macro REQUIRE_UTF8 is used to restart the parse, so that sizes can be calculated correctly for UTF-8. It is possible to trigger this restart when processing multi-character charnames interpolated into the pattern using \N{}. Test that this is handled correctly. M t/re/pat_advanced.t commit 1fcb0bf489074391f4e8e00d2ffb0318a0808d76 Author: Nicholas Clark <n...@ccl4.org> Date: Wed Jan 16 11:48:04 2013 +0100 Perl_sv_uni_display() needs to be aware of RX_WRAPPED() Commit 8d919b0a35f2b57a changed the storage location of the string in SVt_REGEXP. It updated most code to deal with this, but missed the use of SvPVX_const() in Perl_sv_uni_display(). This breaks dumping regular expressions which have the UTF-8 flag set. M ext/Devel-Peek/t/Peek.t M utf8.c commit 85ee183d01471b9d65e3dccd122ccbeca9ddd713 Author: Nicholas Clark <n...@ccl4.org> Date: Mon Jan 14 09:46:48 2013 +0100 Remove unreachable duplicate (?#...) parsing code from S_reg() I believe that this code was rendered unreachable when perl 5.001 added code to S_nextchar() to skip over embedded comments. Adrian Enache noted this in March 2003, and proposed a patch which removed it. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html The patch wasn't applied at that time, and when he sent it again August, he omitted that hunk. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html That version was applied as commit e994fd663a4d8acc. M regcomp.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 2 +- embed.h | 2 +- ext/Devel-Peek/t/Peek.t | 35 +++++++ proto.h | 2 +- regcomp.c | 229 ++++++++++++++++++++++++++++++++--------------- t/re/pat_advanced.t | 8 ++ t/re/re_tests | 4 + utf8.c | 5 +- 8 files changed, 212 insertions(+), 75 deletions(-) diff --git a/embed.fnc b/embed.fnc index b2b46f4..029221e 100644 --- a/embed.fnc +++ b/embed.fnc @@ -2033,7 +2033,7 @@ EsRn |U32 |add_data |NN struct RExC_state_t *pRExC_state|U32 n \ |NN const char *s rs |void |re_croak2 |NN const char* pat1|NN const char* pat2|... Ei |I32 |regpposixcc |NN struct RExC_state_t *pRExC_state \ - |I32 value|NULLOK SV *free_me|const bool strict + |I32 value|const bool strict Es |I32 |make_trie |NN struct RExC_state_t *pRExC_state \ |NN regnode *startbranch|NN regnode *first \ |NN regnode *last|NN regnode *tail \ diff --git a/embed.h b/embed.h index e9a1980..9a7c439 100644 --- a/embed.h +++ b/embed.h @@ -942,7 +942,7 @@ #define reginsert(a,b,c,d) S_reginsert(aTHX_ a,b,c,d) #define regpatws S_regpatws #define regpiece(a,b,c) S_regpiece(aTHX_ a,b,c) -#define regpposixcc(a,b,c,d) S_regpposixcc(aTHX_ a,b,c,d) +#define regpposixcc(a,b,c) S_regpposixcc(aTHX_ a,b,c) #define regtail(a,b,c,d) S_regtail(aTHX_ a,b,c,d) #define reguni(a,b,c) S_reguni(aTHX_ a,b,c) #define regwhite S_regwhite diff --git a/ext/Devel-Peek/t/Peek.t b/ext/Devel-Peek/t/Peek.t index b3dbc9b..a8b0d22 100644 --- a/ext/Devel-Peek/t/Peek.t +++ b/ext/Devel-Peek/t/Peek.t @@ -938,4 +938,39 @@ unless ($Config{useithreads}) { close OUT; } +do_test('UTF-8 in a regular expression', + qr/\x{100}/, +'SV = IV\($ADDR\) at $ADDR + REFCNT = 1 + FLAGS = \(ROK\) + RV = $ADDR + SV = REGEXP\($ADDR\) at $ADDR + REFCNT = 1 + FLAGS = \(OBJECT,FAKE,UTF8\) + PV = $ADDR "\(\?\^u:\\\\\\\\x\{100\}\)" \[UTF8 "\(\?\^u:\\\\\\\\x\{100\}\)"\] + CUR = 13 + STASH = $ADDR "Regexp" + EXTFLAGS = 0x680040 \(CHECK_ALL,USE_INTUIT_NOML,USE_INTUIT_ML\) + INTFLAGS = 0x0 + NPARENS = 0 + LASTPAREN = 0 + LASTCLOSEPAREN = 0 + MINLEN = 1 + MINLENRET = 1 + GOFS = 0 + PRE_PREFIX = 5 + SUBLEN = 0 + SUBOFFSET = 0 + SUBCOFFSET = 0 + SUBBEG = 0x0 + ENGINE = $ADDR + MOTHER_RE = $ADDR + PAREN_NAMES = 0x0 + SUBSTRS = $ADDR + PPRIVATE = $ADDR + OFFS = $ADDR + QR_ANONCV = 0x0 + SAVED_COPY = 0x0 +'); + done_testing(); diff --git a/proto.h b/proto.h index 67eea1d..f17b96e 100644 --- a/proto.h +++ b/proto.h @@ -6700,7 +6700,7 @@ STATIC regnode* S_regpiece(pTHX_ struct RExC_state_t *pRExC_state, I32 *flagp, U #define PERL_ARGS_ASSERT_REGPIECE \ assert(pRExC_state); assert(flagp) -PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, SV *free_me, const bool strict) +PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, I32 value, const bool strict) __attribute__nonnull__(pTHX_1); #define PERL_ARGS_ASSERT_REGPPOSIXCC \ assert(pRExC_state) diff --git a/regcomp.c b/regcomp.c index a16b8b3..e1902ed 100644 --- a/regcomp.c +++ b/regcomp.c @@ -231,8 +231,9 @@ typedef struct RExC_state_t { * REGNODE_SIMPLE */ #define SIMPLE 0x02 #define SPSTART 0x04 /* Starts with * or + */ -#define TRYAGAIN 0x08 /* Weeded out a declaration. */ -#define POSTPONED 0x10 /* (?1),(?&name), (??{...}) or similar */ +#define POSTPONED 0x08 /* (?1),(?&name), (??{...}) or similar */ +#define TRYAGAIN 0x10 /* Weeded out a declaration. */ +#define RESTART_UTF8 0x20 /* Restart, need to calcuate sizes as UTF-8 */ #define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1) @@ -254,7 +255,10 @@ typedef struct RExC_state_t { /* If not already in utf8, do a longjmp back to the beginning */ #define UTF8_LONGJMP 42 /* Choose a value not likely to ever conflict */ #define REQUIRE_UTF8 STMT_START { \ - if (! UTF) JMPENV_JUMP(UTF8_LONGJMP); \ + if (!UTF) { \ + *flagp = RESTART_UTF8; \ + return NULL; \ + } \ } STMT_END /* This converts the named class defined in regcomp.h to its equivalent class @@ -5728,8 +5732,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, SvLEN_set(code_blocksv, 1); /*sufficient to make sv_clear free it*/ } if (reg(pRExC_state, 0, &flags,1) == NULL) { - RExC_precomp = NULL; - return(NULL); + if (flags & RESTART_UTF8) + JMPENV_JUMP(UTF8_LONGJMP); + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing pass, flags=%#X", flags); } if (code_blocksv) SvLEN_set(code_blocksv,0); /* no you can't have it, sv_clear */ @@ -5904,7 +5909,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count, REGC((U8)REG_MAGIC, (char*) RExC_emit++); if (reg(pRExC_state, 0, &flags,1) == NULL) { ReREFCNT_dec(rx); - return(NULL); + Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for generation pass, flags=%#X", flags); } /* XXXX To minimize changes to RE engine we always allocate 3-units-long substrs field. */ @@ -8321,6 +8326,11 @@ S__invlistEQ(pTHX_ SV* const a, SV* const b, bool complement_b) #define REGTAIL_STUDY(x,y,z) regtail((x),(y),(z),depth+1) #endif +/* Returns NULL, setting *flagp to TRYAGAIN at the end of (?) that only sets + flags. Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan + needs to be restarted. + Otherwise would only return NULL if regbranch() returns NULL, which + cannot happen. */ STATIC regnode * S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* paren: Parenthesized? 0=top, 1=(, inside: changed to letter. */ @@ -8630,14 +8640,6 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) case '@': /* (?@...) */ vFAIL2("Sequence (?%c...) not implemented", (int)paren); break; - case '#': /* (?#...) */ - while (*RExC_parse && *RExC_parse != ')') - RExC_parse++; - if (*RExC_parse != ')') - FAIL("Sequence (?#... not terminated"); - nextchar(pRExC_state); - *flagp = TRYAGAIN; - return NULL; case '0' : /* (?0) */ case 'R' : /* (?R) */ if (*RExC_parse != ')') @@ -8803,11 +8805,18 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) || RExC_parse[1] == '<' || RExC_parse[1] == '{') { /* Lookahead or eval. */ I32 flag; + regnode *tail; ret = reg_node(pRExC_state, LOGICAL); if (!SIZE_ONLY) ret->flags = 1; - REGTAIL(pRExC_state, ret, reg(pRExC_state, 1, &flag,depth+1)); + + tail = reg(pRExC_state, 1, &flag, depth+1); + if (flag & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + REGTAIL(pRExC_state, ret, tail); goto insert_if; } } @@ -8875,9 +8884,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) insert_if: REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 0)); br = regbranch(pRExC_state, &flags, 1,depth+1); - if (br == NULL) - br = reganode(pRExC_state, LONGJMP, 0); - else + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } else REGTAIL(pRExC_state, br, reganode(pRExC_state, LONGJMP, 0)); c = *nextchar(pRExC_state); if (flags&HASWIDTH) @@ -8886,7 +8900,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) if (is_define) vFAIL("(?(DEFINE)....) does not allow branches"); lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one for optimizer. */ - regbranch(pRExC_state, &flags, 1,depth+1); + if (!regbranch(pRExC_state, &flags, 1,depth+1)) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", + flags); + } REGTAIL(pRExC_state, ret, lastbr); if (flags&HASWIDTH) *flagp |= HASWIDTH; @@ -9135,8 +9156,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) /* branch_len = (paren != 0); */ - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } if (*RExC_parse == '|') { if (!SIZE_ONLY && RExC_extralen) { reginsert(pRExC_state, BRANCHJ, br, depth+1); @@ -9175,8 +9201,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) } br = regbranch(pRExC_state, &flags, 0, depth+1); - if (br == NULL) - return(NULL); + if (br == NULL) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regbranch returned NULL, flags=%#X", flags); + } REGTAIL(pRExC_state, lastbr, br); /* BRANCH -> BRANCH. */ lastbr = br; *flagp |= flags & (SPSTART | HASWIDTH | POSTPONED); @@ -9333,6 +9364,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) - regbranch - one alternative of an | operator * * Implements the concatenation operator. + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) @@ -9372,7 +9406,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) if (latest == NULL) { if (flags & TRYAGAIN) continue; - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: regpiece returned NULL, flags=%#X", flags); } else if (ret == NULL) ret = latest; @@ -9406,6 +9444,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth) * both the endmarker for their branch list and the body of the last branch. * It might seem that this node could be dispensed with entirely, but the * endmarker role is not redundant. + * + * Returns NULL, setting *flagp to TRYAGAIN if regatom() returns NULL with + * TRYAGAIN. + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + * restarted. */ STATIC regnode * S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) @@ -9434,8 +9477,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) ret = regatom(pRExC_state, &flags,depth+1); if (ret == NULL) { - if (flags & TRYAGAIN) - *flagp |= TRYAGAIN; + if (flags & (TRYAGAIN|RESTART_UTF8)) + *flagp |= flags & (TRYAGAIN|RESTART_UTF8); + else + FAIL2("panic: regatom returned NULL, flags=%#X", flags); return(NULL); } @@ -9665,7 +9710,10 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I The function raises an error (via vFAIL), and doesn't return for various syntax errors. Otherwise it returns TRUE and sets <node_p> or <valuep> on - success; it returns FALSE otherwise. + success; it returns FALSE otherwise. Returns FALSE, setting *flagp to + RESTART_UTF8 if the sizing scan needs to be restarted. Such a restart is + only possible if node_p is non-NULL. + If <valuep> is non-null, it means the caller can accept an input sequence consisting of a just a single code point; <*valuep> is set to that value @@ -9869,7 +9917,14 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, I /* The values are Unicode, and therefore not subject to recoding */ RExC_override_recoding = 1; - *node_p = reg(pRExC_state, 1, &flags, depth+1); + if (!(*node_p = reg(pRExC_state, 1, &flags, depth+1))) { + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return FALSE; + } + FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#X", + flags); + } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); RExC_parse = endbrace; @@ -10069,6 +10124,12 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, regnode *node, I32 escape sequences, with the one for handling literal escapes requiring a dummy entry for all of the special escapes that are actually handled by the other. + + Returns NULL, setting *flagp to TRYAGAIN if reg() returns NULL with + TRYAGAIN. + Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be + restarted. + Otherwise does not return NULL. */ STATIC regnode * @@ -10136,6 +10197,12 @@ tryagain: RExC_parse = oregcomp_parse; vFAIL("Unmatched ["); } + if (ret == NULL) { + if (*flagp & RESTART_UTF8) + return NULL; + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); + } nextchar(pRExC_state); Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */ break; @@ -10152,7 +10219,11 @@ tryagain: } goto tryagain; } - return(NULL); + if (flags & RESTART_UTF8) { + *flagp = RESTART_UTF8; + return NULL; + } + FAIL2("panic: reg returned NULL to regatom, flags=%#X", flags); } *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED); break; @@ -10344,6 +10415,11 @@ tryagain: It would be a bug if these returned non-portables */ NULL); + /* regclass() can only return RESTART_UTF8 if multi-char folds + are allowed. */ + if (!ret) + FAIL2("panic: regclass returned NULL to regatom, flags=%#X", + *flagp); RExC_parse--; @@ -10366,6 +10442,8 @@ tryagain: ++RExC_parse; if (! grok_bslash_N(pRExC_state, &ret, NULL, flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + return NULL; RExC_parse--; goto defchar; } @@ -10632,6 +10710,8 @@ tryagain: flagp, depth, FALSE, FALSE /* not strict */ )) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); RExC_parse = p = oldp; goto loopdone; } @@ -11180,8 +11260,7 @@ S_regpatws( RExC_state_t *pRExC_state, char *p , const bool recognize_comment ) #define POSIXCC(c) (POSIXCC_DONE(c) || POSIXCC_NOTYET(c)) PERL_STATIC_INLINE I32 -S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, - const bool strict) +S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, const bool strict) { dVAR; I32 namedclass = OOB_NAMEDCLASS; @@ -11305,7 +11384,6 @@ S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me, the class closes */ while (UCHARAT(RExC_parse) && UCHARAT(RExC_parse) != ']') RExC_parse++; - SvREFCNT_dec(free_me); vFAIL3("POSIX syntax [%c %c] is reserved for future extensions", c, c); } } else { @@ -11441,13 +11519,18 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char - class only if not a - posix class */ - FALSE, /* don't allow multi-char folds */ - TRUE, /* silence non-portable warnings. */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char + class only if not a + posix class */ + FALSE, /* don't allow multi-char folds */ + TRUE, /* silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); + /* function call leaves parse pointing to the ']', except * if we faked it */ if (is_posix_class) { @@ -11537,12 +11620,15 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, vFAIL("Unexpected character"); case '\\': - (void) regclass(pRExC_state, flagp,depth+1, - TRUE, /* means parse just the next thing */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if (!regclass(pRExC_state, flagp,depth+1, + TRUE, /* means parse just the next thing */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* regclass() will return with parsing just the \ sequence, * leaving the parse pointer at the next thing to parse */ RExC_parse--; @@ -11556,13 +11642,16 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, RExC_parse++; } - (void) regclass(pRExC_state, flagp,depth+1, - is_posix_class, /* parse the whole char class - only if not a posix class */ - FALSE, /* don't allow multi-char folds */ - FALSE, /* don't silence non-portable warnings. - */ - ¤t); + /* regclass() can only return RESTART_UTF8 if multi-char + folds are allowed. */ + if(!regclass(pRExC_state, flagp,depth+1, + is_posix_class, /* parse the whole char class + only if not a posix class */ + FALSE, /* don't allow multi-char folds */ + FALSE, /* don't silence non-portable warnings. */ + ¤t)) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", + *flagp); /* function call leaves parse pointing to the ']', except if we * faked it */ if (is_posix_class) { @@ -11737,6 +11826,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, * already has all folding taken into consideration, and we don't want * regclass() to add to that */ RExC_flags &= ~RXf_PMf_FOLD; + /* regclass() can only return RESTART_UTF8 if multi-char folds are allowed. + */ node = regclass(pRExC_state, flagp,depth+1, FALSE, /* means parse the whole char class */ FALSE, /* don't allow multi-char folds */ @@ -11744,6 +11835,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, well have generated non-portable code points, but they're valid on this machine */ NULL); + if (!node) + FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", flagp); if (save_fold) { RExC_flags |= RXf_PMf_FOLD; } @@ -11793,7 +11886,11 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, * corresponding bit set if that character is in the list. For characters * above 255, a range list or swash is used. There are extra bits for \w, * etc. in locale ANYOFs, as what these match is not determinable at - * compile time */ + * compile time + * + * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs + * to be restarted. This can only happen if ret_invlist is non-NULL. + */ dVAR; UV prevvalue = OOB_UNICODE, save_prevvalue = OOB_UNICODE; @@ -11879,8 +11976,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, if (LOC) { ANYOF_FLAGS(ret) |= ANYOF_LOCALE; } - listsv = newSVpvs("# comment\n"); + listsv = newSVpvs_flags("# comment\n", SVs_TEMP); initial_listsv_len = SvCUR(listsv); + SvTEMP_off(listsv); /* Grr, TEMPs and mortals are conflated. */ } if (skip_white) { @@ -11908,12 +12006,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, s++; if (*s && c == *s && s[1] == ']') { SAVEFREESV(RExC_rx_sv); - SAVEFREESV(listsv); ckWARN3reg(s+2, "POSIX syntax [%c %c] belongs inside character classes", c, c); (void)ReREFCNT_inc(RExC_rx_sv); - SvREFCNT_inc_simple_void_NN(listsv); } } @@ -11965,7 +12061,7 @@ parseit: && RExC_parse < RExC_end && POSIXCC(UCHARAT(RExC_parse))) { - namedclass = regpposixcc(pRExC_state, value, listsv, strict); + namedclass = regpposixcc(pRExC_state, value, strict); } else if (value == '\\') { if (UTF) { @@ -12010,6 +12106,8 @@ parseit: TRUE, /* => charclass */ strict)) { + if (*flagp & RESTART_UTF8) + FAIL("panic: grok_bslash_N set RESTART_UTF8"); goto parseit; } } @@ -12203,7 +12301,6 @@ parseit: value = grok_oct(--RExC_parse, &numlen, &flags, NULL); RExC_parse += numlen; if (numlen != 3) { - SAVEFREESV(listsv); /* In case warnings are fatalized */ if (strict) { RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1; vFAIL("Need exactly 3 octal digits"); @@ -12220,7 +12317,6 @@ parseit: form_short_octal_warning(RExC_parse, numlen)); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } if (PL_encoding && value < 0x100) goto recode_encoding; @@ -12244,7 +12340,6 @@ parseit: default: /* Allow \_ to not give an error */ if (!SIZE_ONLY && isWORDCHAR(value) && value != '_') { - SAVEFREESV(listsv); if (strict) { vFAIL2("Unrecognized escape \\%c in character class", (int)value); @@ -12256,7 +12351,6 @@ parseit: (int)value); (void)ReREFCNT_inc(RExC_rx_sv); } - SvREFCNT_inc_simple_void_NN(listsv); } break; } /* End of switch on char following backslash */ @@ -12302,7 +12396,6 @@ parseit: const int w = (RExC_parse >= rangebegin) ? RExC_parse - rangebegin : 0; - SAVEFREESV(listsv); /* in case of fatal warnings */ if (strict) { vFAIL4("False [] range \"%*.*s\"", w, w, rangebegin); } @@ -12315,7 +12408,6 @@ parseit: cp_list = add_cp_to_invlist(cp_list, '-'); cp_list = add_cp_to_invlist(cp_list, prevvalue); } - SvREFCNT_inc_simple_void_NN(listsv); } range = 0; /* this was not a true range */ @@ -12851,13 +12943,12 @@ parseit: ret = reg(pRExC_state, 1, ®_flags, depth+1); - *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED); + *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_UTF8); RExC_parse = save_parse; RExC_end = save_end; RExC_in_multi_char_class = 0; SvREFCNT_dec_NN(multi_char_matches); - SvREFCNT_dec_NN(listsv); return ret; } @@ -13015,7 +13106,6 @@ parseit: RExC_parse = (char *) cur_parse; SvREFCNT_dec(posixes); - SvREFCNT_dec_NN(listsv); SvREFCNT_dec(cp_list); return ret; } @@ -13383,7 +13473,7 @@ parseit: else { RExC_emit = orig_emit; } - return END; + return orig_emit; } /* If we didn't do folding, it's because some information isn't available @@ -13517,7 +13607,6 @@ parseit: } SvREFCNT_dec_NN(cp_list); - SvREFCNT_dec_NN(listsv); return ret; } } @@ -13605,7 +13694,6 @@ parseit: && ! HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) { ARG_SET(ret, ANYOF_NONBITMAP_EMPTY); - SvREFCNT_dec_NN(listsv); } else { /* av[0] stores the character class description in its textual form: @@ -13622,8 +13710,7 @@ parseit: SV *rv; av_store(av, 0, (HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION) - ? listsv - : (SvREFCNT_dec_NN(listsv), &PL_sv_undef)); + ? SvREFCNT_inc(listsv) : &PL_sv_undef); if (swash) { av_store(av, 1, swash); SvREFCNT_dec_NN(cp_list); diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index 29a64dd..cd067f4 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -1082,6 +1082,14 @@ sub run_tests { eval "q(W) =~ /\\N{$name}/"; ok ! $w, 'Verify that latin1 letter in name doesnt give warning'; + # This tests the code path that restarts the parse when the recursive + # call to S_reg() from within S_grok_bslash_N() discovers that the + # pattern needs to be recalculated as UTF-8. use eval to avoid + # needing literal Unicode in this source file: + my $r = eval "qr/\\N{\x{100}\x{100}}/"; + isnt $r, undef, "Generated regex for multi-char UTF-8 charname" + or diag($@); + ok "\x{100}\x{100}" =~ $r, "which matches"; } { diff --git a/t/re/re_tests b/t/re/re_tests index c41d529..7e7fc85 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -592,6 +592,10 @@ x(~~)*(?:(?:F)?)? x~~ y - - (?(?!a)b|a) a y $& a (?(?=a)b|a) a n - - (?(?=a)a|b) a y $& a +(?(?!\x{100})\x{100}|b) \x{100} n - - +(?(?!\x{100})b|\x{100}) \x{100} y $& \x{100} +(?(?=\x{100})b|\x{100}) \x{100} n - - +(?(?=\x{100})\x{100}|b) \x{100} y $& \x{100} (?=(a+?))(\1ab) aaab y $2 aab ^(?=(a+?))\1ab aaab n - - (\w+:)+ one: y $1 one: diff --git a/utf8.c b/utf8.c index 1bf3f52..ec3a0ba 100644 --- a/utf8.c +++ b/utf8.c @@ -4423,9 +4423,12 @@ The pointer to the PV of the C<dsv> is returned. char * Perl_sv_uni_display(pTHX_ SV *dsv, SV *ssv, STRLEN pvlim, UV flags) { + const char * const ptr = + isREGEXP(ssv) ? RX_WRAPPED((REGEXP*)ssv) : SvPVX_const(ssv); + PERL_ARGS_ASSERT_SV_UNI_DISPLAY; - return Perl_pv_uni_display(aTHX_ dsv, (const U8*)SvPVX_const(ssv), + return Perl_pv_uni_display(aTHX_ dsv, (const U8*)ptr, SvCUR(ssv), pvlim, flags); } -- Perl5 Master Repository