In perl.git, the branch smoke-me/remove-regcomp-setjmp has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/470c26f2012b3f4f6540ae6ee7a595b049974699?hp=36baafc9b54ecc5cdea82552276a70b5218958bb>

- Log -----------------------------------------------------------------
commit 470c26f2012b3f4f6540ae6ee7a595b049974699
Author: Nicholas Clark <n...@ccl4.org>
Date:   Sat Jan 19 11:06:10 2013 +0100

    Move the longjmp() that implements REQUIRE_UTF8 up to Perl_re_op_compile().
    
    With longjmp() and setjmp() now in the same function (and all tests 
passing),
    it becomes easy to replace the pair with a goto. Still evil, but "the lesser
    of two evils".

M       regcomp.c

commit 9e18d992386fc2004b8565aa28bfca69ef0b0532
Author: Nicholas Clark <n...@ccl4.org>
Date:   Fri Jan 18 17:21:03 2013 +0100

    Add a flag RESTART_UTF8 to the reg*() routines in regcomp.c
    
    Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to
    permit the parse to be restarted without using longjmp(). However, it's not
    used yet.

M       regcomp.c

commit 9896e62575838e26122c6cf2e600303e136e0fbf
Author: Nicholas Clark <n...@ccl4.org>
Date:   Fri Jan 18 11:32:44 2013 +0100

    In S_regclass(), create listsv as a mortal, claiming a reference if needed.
    
    The SV listsv is sometimes stored in an array generated near the end of
    S_regclass(). In other cases it is not used, and it needs to be freed if
    any of the warnings that S_regclass() can trigger turn out to be fatal.
    
    The simplest solution to this problem is to declare it from the start as a
    mortal, and claim a (new) reference to it if it is *not* to be freed. This
    permits the removal of all other code related to ensuring that it is freed
    at the right time, but not freed prematurely if a call to a warning returns.

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit 7ed73f7b0ef42242d4b31c1b414ae4d872cb8f05
Author: Nicholas Clark <n...@ccl4.org>
Date:   Thu Jan 17 11:47:13 2013 +0100

    Document when and why S_reg{,branch,piece,atom,class}() return NULL.
    
    As documented in pod/perlreguts.pod, the call graph for regex parsing
    involves several levels of functions in regcomp.c, sometimes recursing more
    than once.
    
    The top level compiling function, S_reg(), calls S_regbranch() to parse each
    single branch of an alternation. In turn, that calls S_regpiece() to parse
    a simple pattern followed by quantifier, which calls S_regatom() to parse
    that simple pattern. S_regatom() can call S_regclass() to handle classes,
    but can also recurse into S_reg() to handle subpatterns and some other
    constructions. Some other routines call call S_reg(), sometimes using an
    alternative pattern that they generate dynamically to represent their input.
    
    These routines all return a pointer to a regnode structure, and take a
    pointer to an integer that holds flags, which is also used to return
    information.
    
    Historically, it has not been clear when and why they return NULL, and
    whether the return value can be ignored. In particular, "Jumbo regexp patch"
    (commit c277df42229d99fe, from Nov 1997), added code with two calls from
    S_reg() to S_regbranch(), one of which checks the return value and generates
    a LONGJMP node if it returns NULL, the other of which is called in void
    context, and so both ignores any return value, or the possibility that it is
    NULL.
    
    After some analysis I have untangled the possible return values from these
    5 functions (and related functions which call S_reg()).
    
    Starting from the top:
    S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma-
    line constructions that it handles. Otherwise, historically it would return
    NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return
    NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece()
    returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.
    
    S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with
    TRYAGAIN, but (historically) if S_regatom() returns NULL without setting
    the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls
    S_reg() it has similar behaviour when passing back return values, although
    often it is able to loop instead on getting a TRYAGAIN.
    
    Which gets us back to S_reg(), which can only *generate* NULL in conjunction
    with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it
    called generated it. All other functions that these call that return regnode
    structures cannot return NULL. Hence
    
    1) in the loop of functions called, there is no source for a return value of
       NULL without the TRYAGAIN flag being set
    2) a return value of NULL with TRYAGAIN set from an inner function does not
       propagate out past S_regbranch()
    
    Hence the only return values that most functions can generate are non-NULL,
    or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot
    return NULL. The longest sequence of functions that can return NULL (with
    TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch().
    Rapidly returning right round the loop back to S_reg() is not possible.
    
    Hence code added by commit c277df42229d99fe to handle a NULL return from
    S_regbranch(), along with some other code is dead.
    
    I have replaced all unreachable code with FAIL()s that panic.

M       regcomp.c

commit 22bd5ba57cf1bc90e91482b7bc64453e0a9516d7
Author: Nicholas Clark <n...@ccl4.org>
Date:   Fri Jan 18 16:30:39 2013 +0100

    Return orig_emit from S_regclass() when ret_invlist is true.
    
    The return value isn't used (yet). Previously the code was returning END,
    which is a macro for the regnode number for "End of program" which happens 
to
    be 0. It happens that 0 is valid C for a NULL pointer, but using it in this
    way makes the intent unclear. Not only is orig_emit a valid type, it's
    semantically the correct thing to return, as it's most recently added node.

M       regcomp.c

commit 36d78074bb875bbca35cab1faa9dcf5ab9b33755
Author: Nicholas Clark <n...@ccl4.org>
Date:   Wed Jan 16 21:58:02 2013 +0100

    Test that UTF-8 in the look-ahead of (?(?=...)...) restarts the sizing 
parse.
    
    S_reg() recurses to itself to parse various constructions used as the
    conditionals in conditional matching. Look-aheads and look-behinds can turn
    out to need to be sized as UTF-8, which can cause the inner S_reg() to use
    the macro REQUIRE_UTF8 is used to restart the parse. Test that this is
    handled correctly.

M       t/re/re_tests

commit 6bfd9d40bc3b041083654c5119f467f9a6316424
Author: Nicholas Clark <n...@ccl4.org>
Date:   Wed Jan 16 17:08:03 2013 +0100

    Test that S_grok_bslash_N() copes if S_reg() restarts the sizing parse.
    
    S_reg() can discover midway through parsing the pattern to determine its
    size, that the pattern will actually need to be encoded as UTF-8. If
    calculations so far have been done in terms of bytes, then the macro
    REQUIRE_UTF8 is used to restart the parse, so that sizes can be calculated
    correctly for UTF-8.
    
    It is possible to trigger this restart when processing multi-character
    charnames interpolated into the pattern using \N{}. Test that this is
    handled correctly.

M       t/re/pat_advanced.t

commit 1fcb0bf489074391f4e8e00d2ffb0318a0808d76
Author: Nicholas Clark <n...@ccl4.org>
Date:   Wed Jan 16 11:48:04 2013 +0100

    Perl_sv_uni_display() needs to be aware of RX_WRAPPED()
    
    Commit 8d919b0a35f2b57a changed the storage location of the string in
    SVt_REGEXP. It updated most code to deal with this, but missed the use of
    SvPVX_const() in Perl_sv_uni_display(). This breaks dumping regular
    expressions which have the UTF-8 flag set.

M       ext/Devel-Peek/t/Peek.t
M       utf8.c

commit 85ee183d01471b9d65e3dccd122ccbeca9ddd713
Author: Nicholas Clark <n...@ccl4.org>
Date:   Mon Jan 14 09:46:48 2013 +0100

    Remove unreachable duplicate (?#...) parsing code from S_reg()
    
    I believe that this code was rendered unreachable when perl 5.001 added
    code to S_nextchar() to skip over embedded comments. Adrian Enache noted
    this in March 2003, and proposed a patch which removed it. See
    http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html
    
    The patch wasn't applied at that time, and when he sent it again August,
    he omitted that hunk. See
    http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html
    
    That version was applied as commit e994fd663a4d8acc.

M       regcomp.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc               |    2 +-
 embed.h                 |    2 +-
 ext/Devel-Peek/t/Peek.t |   35 +++++++
 proto.h                 |    2 +-
 regcomp.c               |  229 ++++++++++++++++++++++++++++++++---------------
 t/re/pat_advanced.t     |    8 ++
 t/re/re_tests           |    4 +
 utf8.c                  |    5 +-
 8 files changed, 212 insertions(+), 75 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index b2b46f4..029221e 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -2033,7 +2033,7 @@ EsRn      |U32    |add_data       |NN struct RExC_state_t 
*pRExC_state|U32 n \
                                |NN const char *s
 rs     |void   |re_croak2      |NN const char* pat1|NN const char* pat2|...
 Ei     |I32    |regpposixcc    |NN struct RExC_state_t *pRExC_state \
-                               |I32 value|NULLOK SV *free_me|const bool strict
+                               |I32 value|const bool strict
 Es     |I32    |make_trie      |NN struct RExC_state_t *pRExC_state \
                                |NN regnode *startbranch|NN regnode *first \
                                |NN regnode *last|NN regnode *tail \
diff --git a/embed.h b/embed.h
index e9a1980..9a7c439 100644
--- a/embed.h
+++ b/embed.h
@@ -942,7 +942,7 @@
 #define reginsert(a,b,c,d)     S_reginsert(aTHX_ a,b,c,d)
 #define regpatws               S_regpatws
 #define regpiece(a,b,c)                S_regpiece(aTHX_ a,b,c)
-#define regpposixcc(a,b,c,d)   S_regpposixcc(aTHX_ a,b,c,d)
+#define regpposixcc(a,b,c)     S_regpposixcc(aTHX_ a,b,c)
 #define regtail(a,b,c,d)       S_regtail(aTHX_ a,b,c,d)
 #define reguni(a,b,c)          S_reguni(aTHX_ a,b,c)
 #define regwhite               S_regwhite
diff --git a/ext/Devel-Peek/t/Peek.t b/ext/Devel-Peek/t/Peek.t
index b3dbc9b..a8b0d22 100644
--- a/ext/Devel-Peek/t/Peek.t
+++ b/ext/Devel-Peek/t/Peek.t
@@ -938,4 +938,39 @@ unless ($Config{useithreads}) {
     close OUT;
 }
 
+do_test('UTF-8 in a regular expression',
+        qr/\x{100}/,
+'SV = IV\($ADDR\) at $ADDR
+  REFCNT = 1
+  FLAGS = \(ROK\)
+  RV = $ADDR
+  SV = REGEXP\($ADDR\) at $ADDR
+    REFCNT = 1
+    FLAGS = \(OBJECT,FAKE,UTF8\)
+    PV = $ADDR "\(\?\^u:\\\\\\\\x\{100\}\)" \[UTF8 
"\(\?\^u:\\\\\\\\x\{100\}\)"\]
+    CUR = 13
+    STASH = $ADDR      "Regexp"
+    EXTFLAGS = 0x680040 \(CHECK_ALL,USE_INTUIT_NOML,USE_INTUIT_ML\)
+    INTFLAGS = 0x0
+    NPARENS = 0
+    LASTPAREN = 0
+    LASTCLOSEPAREN = 0
+    MINLEN = 1
+    MINLENRET = 1
+    GOFS = 0
+    PRE_PREFIX = 5
+    SUBLEN = 0
+    SUBOFFSET = 0
+    SUBCOFFSET = 0
+    SUBBEG = 0x0
+    ENGINE = $ADDR
+    MOTHER_RE = $ADDR
+    PAREN_NAMES = 0x0
+    SUBSTRS = $ADDR
+    PPRIVATE = $ADDR
+    OFFS = $ADDR
+    QR_ANONCV = 0x0
+    SAVED_COPY = 0x0
+');
+
 done_testing();
diff --git a/proto.h b/proto.h
index 67eea1d..f17b96e 100644
--- a/proto.h
+++ b/proto.h
@@ -6700,7 +6700,7 @@ STATIC regnode*   S_regpiece(pTHX_ struct RExC_state_t 
*pRExC_state, I32 *flagp, U
 #define PERL_ARGS_ASSERT_REGPIECE      \
        assert(pRExC_state); assert(flagp)
 
-PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, 
I32 value, SV *free_me, const bool strict)
+PERL_STATIC_INLINE I32 S_regpposixcc(pTHX_ struct RExC_state_t *pRExC_state, 
I32 value, const bool strict)
                        __attribute__nonnull__(pTHX_1);
 #define PERL_ARGS_ASSERT_REGPPOSIXCC   \
        assert(pRExC_state)
diff --git a/regcomp.c b/regcomp.c
index a16b8b3..e1902ed 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -231,8 +231,9 @@ typedef struct RExC_state_t {
  * REGNODE_SIMPLE */
 #define        SIMPLE          0x02
 #define        SPSTART         0x04    /* Starts with * or + */
-#define TRYAGAIN       0x08    /* Weeded out a declaration. */
-#define POSTPONED      0x10    /* (?1),(?&name), (??{...}) or similar */
+#define POSTPONED      0x08    /* (?1),(?&name), (??{...}) or similar */
+#define TRYAGAIN       0x10    /* Weeded out a declaration. */
+#define RESTART_UTF8    0x20    /* Restart, need to calcuate sizes as UTF-8 */
 
 #define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1)
 
@@ -254,7 +255,10 @@ typedef struct RExC_state_t {
 /* If not already in utf8, do a longjmp back to the beginning */
 #define UTF8_LONGJMP 42 /* Choose a value not likely to ever conflict */
 #define REQUIRE_UTF8   STMT_START {                                       \
-                                     if (! UTF) JMPENV_JUMP(UTF8_LONGJMP); \
+                                     if (!UTF) {                           \
+                                         *flagp = RESTART_UTF8;            \
+                                         return NULL;                      \
+                                     }                                     \
                         } STMT_END
 
 /* This converts the named class defined in regcomp.h to its equivalent class
@@ -5728,8 +5732,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
        SvLEN_set(code_blocksv, 1); /*sufficient to make sv_clear free it*/
     }
     if (reg(pRExC_state, 0, &flags,1) == NULL) {
-       RExC_precomp = NULL;
-       return(NULL);
+        if (flags & RESTART_UTF8)
+            JMPENV_JUMP(UTF8_LONGJMP);
+        Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing 
pass, flags=%#X", flags);
     }
     if (code_blocksv)
        SvLEN_set(code_blocksv,0); /* no you can't have it, sv_clear */
@@ -5904,7 +5909,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
     REGC((U8)REG_MAGIC, (char*) RExC_emit++);
     if (reg(pRExC_state, 0, &flags,1) == NULL) {
        ReREFCNT_dec(rx);   
-       return(NULL);
+        Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for 
generation pass, flags=%#X", flags);
     }
     /* XXXX To minimize changes to RE engine we always allocate
        3-units-long substrs field. */
@@ -8321,6 +8326,11 @@ S__invlistEQ(pTHX_ SV* const a, SV* const b, bool 
complement_b)
 #define REGTAIL_STUDY(x,y,z) regtail((x),(y),(z),depth+1)
 #endif
 
+/* Returns NULL, setting *flagp to TRYAGAIN at the end of (?) that only sets
+   flags. Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan
+   needs to be restarted.
+   Otherwise would only return NULL if regbranch() returns NULL, which
+   cannot happen.  */
 STATIC regnode *
 S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
     /* paren: Parenthesized? 0=top, 1=(, inside: changed to letter. */
@@ -8630,14 +8640,6 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
            case '@':           /* (?@...) */
                vFAIL2("Sequence (?%c...) not implemented", (int)paren);
                break;
-           case '#':           /* (?#...) */
-               while (*RExC_parse && *RExC_parse != ')')
-                   RExC_parse++;
-               if (*RExC_parse != ')')
-                   FAIL("Sequence (?#... not terminated");
-               nextchar(pRExC_state);
-               *flagp = TRYAGAIN;
-               return NULL;
            case '0' :           /* (?0) */
            case 'R' :           /* (?R) */
                if (*RExC_parse != ')')
@@ -8803,11 +8805,18 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                        || RExC_parse[1] == '<'
                        || RExC_parse[1] == '{') { /* Lookahead or eval. */
                        I32 flag;
+                        regnode *tail;
 
                        ret = reg_node(pRExC_state, LOGICAL);
                        if (!SIZE_ONLY)
                            ret->flags = 1;
-                        REGTAIL(pRExC_state, ret, reg(pRExC_state, 1, 
&flag,depth+1));
+                        
+                        tail = reg(pRExC_state, 1, &flag, depth+1);
+                        if (flag & RESTART_UTF8) {
+                            *flagp = RESTART_UTF8;
+                            return NULL;
+                        }
+                        REGTAIL(pRExC_state, ret, tail);
                        goto insert_if;
                    }
                }
@@ -8875,9 +8884,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                  insert_if:
                     REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 
0));
                     br = regbranch(pRExC_state, &flags, 1,depth+1);
-                   if (br == NULL)
-                       br = reganode(pRExC_state, LONGJMP, 0);
-                   else
+                   if (br == NULL) {
+                        if (flags & RESTART_UTF8) {
+                            *flagp = RESTART_UTF8;
+                            return NULL;
+                        }
+                        FAIL2("panic: regbranch returned NULL, flags=%#X",
+                              flags);
+                    } else
                         REGTAIL(pRExC_state, br, reganode(pRExC_state, 
LONGJMP, 0));
                    c = *nextchar(pRExC_state);
                    if (flags&HASWIDTH)
@@ -8886,7 +8900,14 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                        if (is_define) 
                            vFAIL("(?(DEFINE)....) does not allow branches");
                        lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one 
for optimizer. */
-                        regbranch(pRExC_state, &flags, 1,depth+1);
+                        if (!regbranch(pRExC_state, &flags, 1,depth+1)) {
+                            if (flags & RESTART_UTF8) {
+                                *flagp = RESTART_UTF8;
+                                return NULL;
+                            }
+                            FAIL2("panic: regbranch returned NULL, flags=%#X",
+                                  flags);
+                        }
                         REGTAIL(pRExC_state, ret, lastbr);
                        if (flags&HASWIDTH)
                            *flagp |= HASWIDTH;
@@ -9135,8 +9156,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
 
     /*     branch_len = (paren != 0); */
 
-    if (br == NULL)
-       return(NULL);
+    if (br == NULL) {
+        if (flags & RESTART_UTF8) {
+            *flagp = RESTART_UTF8;
+            return NULL;
+        }
+        FAIL2("panic: regbranch returned NULL, flags=%#X", flags);
+    }
     if (*RExC_parse == '|') {
        if (!SIZE_ONLY && RExC_extralen) {
            reginsert(pRExC_state, BRANCHJ, br, depth+1);
@@ -9175,8 +9201,13 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
         }
         br = regbranch(pRExC_state, &flags, 0, depth+1);
 
-       if (br == NULL)
-           return(NULL);
+       if (br == NULL) {
+            if (flags & RESTART_UTF8) {
+                *flagp = RESTART_UTF8;
+                return NULL;
+            }
+            FAIL2("panic: regbranch returned NULL, flags=%#X", flags);
+        }
         REGTAIL(pRExC_state, lastbr, br);               /* BRANCH -> BRANCH. */
        lastbr = br;
        *flagp |= flags & (SPSTART | HASWIDTH | POSTPONED);
@@ -9333,6 +9364,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
  - regbranch - one alternative of an | operator
  *
  * Implements the concatenation operator.
+ *
+ * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
+ * restarted.
  */
 STATIC regnode *
 S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth)
@@ -9372,7 +9406,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
I32 first, U32 depth)
        if (latest == NULL) {
            if (flags & TRYAGAIN)
                continue;
-           return(NULL);
+            if (flags & RESTART_UTF8) {
+                *flagp = RESTART_UTF8;
+                return NULL;
+            }
+            FAIL2("panic: regpiece returned NULL, flags=%#X", flags);
        }
        else if (ret == NULL)
            ret = latest;
@@ -9406,6 +9444,11 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
I32 first, U32 depth)
  * both the endmarker for their branch list and the body of the last branch.
  * It might seem that this node could be dispensed with entirely, but the
  * endmarker role is not redundant.
+ *
+ * Returns NULL, setting *flagp to TRYAGAIN if regatom() returns NULL with
+ * TRYAGAIN.
+ * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
+ * restarted.
  */
 STATIC regnode *
 S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
@@ -9434,8 +9477,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
 
     ret = regatom(pRExC_state, &flags,depth+1);
     if (ret == NULL) {
-       if (flags & TRYAGAIN)
-           *flagp |= TRYAGAIN;
+       if (flags & (TRYAGAIN|RESTART_UTF8))
+           *flagp |= flags & (TRYAGAIN|RESTART_UTF8);
+        else
+            FAIL2("panic: regatom returned NULL, flags=%#X", flags);
        return(NULL);
     }
 
@@ -9665,7 +9710,10 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, 
regnode** node_p, UV *valuep, I
 
    The function raises an error (via vFAIL), and doesn't return for various
    syntax errors.  Otherwise it returns TRUE and sets <node_p> or <valuep> on
-   success; it returns FALSE otherwise.
+   success; it returns FALSE otherwise. Returns FALSE, setting *flagp to
+   RESTART_UTF8 if the sizing scan needs to be restarted. Such a restart is
+   only possible if node_p is non-NULL.
+
 
    If <valuep> is non-null, it means the caller can accept an input sequence
    consisting of a just a single code point; <*valuep> is set to that value
@@ -9869,7 +9917,14 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, 
regnode** node_p, UV *valuep, I
        /* The values are Unicode, and therefore not subject to recoding */
        RExC_override_recoding = 1;
 
-       *node_p = reg(pRExC_state, 1, &flags, depth+1);
+       if (!(*node_p = reg(pRExC_state, 1, &flags, depth+1))) {
+            if (flags & RESTART_UTF8) {
+                *flagp = RESTART_UTF8;
+                return FALSE;
+            }
+            FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#X",
+                  flags);
+        } 
        *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED);
 
        RExC_parse = endbrace;
@@ -10069,6 +10124,12 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t 
*pRExC_state, regnode *node, I32
    escape sequences, with the one for handling literal escapes requiring
    a dummy entry for all of the special escapes that are actually handled
    by the other.
+
+   Returns NULL, setting *flagp to TRYAGAIN if reg() returns NULL with
+   TRYAGAIN.  
+   Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
+   restarted.
+   Otherwise does not return NULL.
 */
 
 STATIC regnode *
@@ -10136,6 +10197,12 @@ tryagain:
            RExC_parse = oregcomp_parse;
            vFAIL("Unmatched [");
        }
+        if (ret == NULL) {
+            if (*flagp & RESTART_UTF8)
+                return NULL;
+            FAIL2("panic: regclass returned NULL to regatom, flags=%#X",
+                  *flagp);
+        }
        nextchar(pRExC_state);
         Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */
        break;
@@ -10152,7 +10219,11 @@ tryagain:
                    }
                    goto tryagain;
                }
-               return(NULL);
+                if (flags & RESTART_UTF8) {
+                    *flagp = RESTART_UTF8;
+                    return NULL;
+                }
+                FAIL2("panic: reg returned NULL to regatom, flags=%#X", flags);
        }
        *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED);
        break;
@@ -10344,6 +10415,11 @@ tryagain:
                                          It would be a bug if these returned
                                          non-portables */
                                NULL);
+                /* regclass() can only return RESTART_UTF8 if multi-char folds
+                   are allowed.  */
+                if (!ret)
+                    FAIL2("panic: regclass returned NULL to regatom, 
flags=%#X",
+                          *flagp);
 
                RExC_parse--;
 
@@ -10366,6 +10442,8 @@ tryagain:
             ++RExC_parse;
             if (! grok_bslash_N(pRExC_state, &ret, NULL, flagp, depth, FALSE,
                                 FALSE /* not strict */ )) {
+                if (*flagp & RESTART_UTF8)
+                    return NULL;
                 RExC_parse--;
                 goto defchar;
             }
@@ -10632,6 +10710,8 @@ tryagain:
                                             flagp, depth, FALSE,
                                             FALSE /* not strict */ ))
                         {
+                            if (*flagp & RESTART_UTF8)
+                                FAIL("panic: grok_bslash_N set RESTART_UTF8");
                             RExC_parse = p = oldp;
                             goto loopdone;
                         }
@@ -11180,8 +11260,7 @@ S_regpatws( RExC_state_t *pRExC_state, char *p , const 
bool recognize_comment )
 #define POSIXCC(c) (POSIXCC_DONE(c) || POSIXCC_NOTYET(c))
 
 PERL_STATIC_INLINE I32
-S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, SV *free_me,
-                    const bool strict)
+S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 value, const bool strict)
 {
     dVAR;
     I32 namedclass = OOB_NAMEDCLASS;
@@ -11305,7 +11384,6 @@ S_regpposixcc(pTHX_ RExC_state_t *pRExC_state, I32 
value, SV *free_me,
                       the class closes */
                    while (UCHARAT(RExC_parse) && UCHARAT(RExC_parse) != ']')
                        RExC_parse++;
-                   SvREFCNT_dec(free_me);
                    vFAIL3("POSIX syntax [%c %c] is reserved for future 
extensions", c, c);
                }
            } else {
@@ -11441,13 +11519,18 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                         RExC_parse++;
                     }
 
-                    (void) regclass(pRExC_state, flagp,depth+1,
-                                    is_posix_class, /* parse the whole char
-                                                       class only if not a
-                                                       posix class */
-                                    FALSE, /* don't allow multi-char folds */
-                                    TRUE, /* silence non-portable warnings. */
-                                    &current);
+                    /* regclass() can only return RESTART_UTF8 if multi-char
+                       folds are allowed.  */
+                    if (!regclass(pRExC_state, flagp,depth+1,
+                                  is_posix_class, /* parse the whole char
+                                                     class only if not a
+                                                     posix class */
+                                  FALSE, /* don't allow multi-char folds */
+                                  TRUE, /* silence non-portable warnings. */
+                                  &current))
+                        FAIL2("panic: regclass returned NULL to handle_sets, 
flags=%#X",
+                              *flagp);
+
                     /* function call leaves parse pointing to the ']', except
                      * if we faked it */
                     if (is_posix_class) {
@@ -11537,12 +11620,15 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                 vFAIL("Unexpected character");
 
             case '\\':
-                (void) regclass(pRExC_state, flagp,depth+1,
-                                TRUE, /* means parse just the next thing */
-                                FALSE, /* don't allow multi-char folds */
-                                FALSE, /* don't silence non-portable warnings.
-                                        */
-                                &current);
+                /* regclass() can only return RESTART_UTF8 if multi-char
+                   folds are allowed.  */
+                if (!regclass(pRExC_state, flagp,depth+1,
+                              TRUE, /* means parse just the next thing */
+                              FALSE, /* don't allow multi-char folds */
+                              FALSE, /* don't silence non-portable warnings.  
*/
+                              &current))
+                    FAIL2("panic: regclass returned NULL to handle_sets, 
flags=%#X",
+                          *flagp);
                 /* regclass() will return with parsing just the \ sequence,
                  * leaving the parse pointer at the next thing to parse */
                 RExC_parse--;
@@ -11556,13 +11642,16 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                     RExC_parse++;
                 }
 
-                (void) regclass(pRExC_state, flagp,depth+1,
-                                is_posix_class, /* parse the whole char class
-                                                   only if not a posix class */
-                                FALSE, /* don't allow multi-char folds */
-                                FALSE, /* don't silence non-portable warnings.
-                                        */
-                                &current);
+                /* regclass() can only return RESTART_UTF8 if multi-char
+                   folds are allowed.  */
+                if(!regclass(pRExC_state, flagp,depth+1,
+                             is_posix_class, /* parse the whole char class
+                                                only if not a posix class */
+                             FALSE, /* don't allow multi-char folds */
+                             FALSE, /* don't silence non-portable warnings.  */
+                             &current))
+                    FAIL2("panic: regclass returned NULL to handle_sets, 
flags=%#X",
+                          *flagp);
                 /* function call leaves parse pointing to the ']', except if we
                  * faked it */
                 if (is_posix_class) {
@@ -11737,6 +11826,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
      * already has all folding taken into consideration, and we don't want
      * regclass() to add to that */
     RExC_flags &= ~RXf_PMf_FOLD;
+    /* regclass() can only return RESTART_UTF8 if multi-char folds are allowed.
+     */
     node = regclass(pRExC_state, flagp,depth+1,
                     FALSE, /* means parse the whole char class */
                     FALSE, /* don't allow multi-char folds */
@@ -11744,6 +11835,8 @@ S_handle_sets(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                              well have generated non-portable code points, but
                              they're valid on this machine */
                     NULL);
+    if (!node)
+        FAIL2("panic: regclass returned NULL to handle_sets, flags=%#X", 
flagp);
     if (save_fold) {
         RExC_flags |= RXf_PMf_FOLD;
     }
@@ -11793,7 +11886,11 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
      * corresponding bit set if that character is in the list.  For characters
      * above 255, a range list or swash is used.  There are extra bits for \w,
      * etc. in locale ANYOFs, as what these match is not determinable at
-     * compile time */
+     * compile time
+     *
+     * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs
+     * to be restarted.  This can only happen if ret_invlist is non-NULL.
+     */
 
     dVAR;
     UV prevvalue = OOB_UNICODE, save_prevvalue = OOB_UNICODE;
@@ -11879,8 +11976,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth,
        if (LOC) {
            ANYOF_FLAGS(ret) |= ANYOF_LOCALE;
        }
-       listsv = newSVpvs("# comment\n");
+       listsv = newSVpvs_flags("# comment\n", SVs_TEMP);
        initial_listsv_len = SvCUR(listsv);
+        SvTEMP_off(listsv); /* Grr, TEMPs and mortals are conflated.  */
     }
 
     if (skip_white) {
@@ -11908,12 +12006,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
            s++;
        if (*s && c == *s && s[1] == ']') {
            SAVEFREESV(RExC_rx_sv);
-           SAVEFREESV(listsv);
            ckWARN3reg(s+2,
                       "POSIX syntax [%c %c] belongs inside character classes",
                       c, c);
            (void)ReREFCNT_inc(RExC_rx_sv);
-           SvREFCNT_inc_simple_void_NN(listsv);
        }
     }
 
@@ -11965,7 +12061,7 @@ parseit:
             && RExC_parse < RExC_end
             && POSIXCC(UCHARAT(RExC_parse)))
         {
-            namedclass = regpposixcc(pRExC_state, value, listsv, strict);
+            namedclass = regpposixcc(pRExC_state, value, strict);
         }
         else if (value == '\\') {
            if (UTF) {
@@ -12010,6 +12106,8 @@ parseit:
                                       TRUE, /* => charclass */
                                       strict))
                     {
+                        if (*flagp & RESTART_UTF8)
+                            FAIL("panic: grok_bslash_N set RESTART_UTF8");
                         goto parseit;
                     }
                 }
@@ -12203,7 +12301,6 @@ parseit:
                     value = grok_oct(--RExC_parse, &numlen, &flags, NULL);
                    RExC_parse += numlen;
                     if (numlen != 3) {
-                        SAVEFREESV(listsv); /* In case warnings are fatalized 
*/
                         if (strict) {
                             RExC_parse += (UTF) ? UTF8SKIP(RExC_parse) : 1;
                             vFAIL("Need exactly 3 octal digits");
@@ -12220,7 +12317,6 @@ parseit:
                                  form_short_octal_warning(RExC_parse, numlen));
                             (void)ReREFCNT_inc(RExC_rx_sv);
                         }
-                        SvREFCNT_inc_simple_void_NN(listsv);
                     }
                    if (PL_encoding && value < 0x100)
                        goto recode_encoding;
@@ -12244,7 +12340,6 @@ parseit:
            default:
                /* Allow \_ to not give an error */
                if (!SIZE_ONLY && isWORDCHAR(value) && value != '_') {
-                   SAVEFREESV(listsv);
                     if (strict) {
                         vFAIL2("Unrecognized escape \\%c in character class",
                                (int)value);
@@ -12256,7 +12351,6 @@ parseit:
                             (int)value);
                         (void)ReREFCNT_inc(RExC_rx_sv);
                     }
-                   SvREFCNT_inc_simple_void_NN(listsv);
                }
                break;
            }   /* End of switch on char following backslash */
@@ -12302,7 +12396,6 @@ parseit:
                    const int w = (RExC_parse >= rangebegin)
                                   ? RExC_parse - rangebegin
                                   : 0;
-                   SAVEFREESV(listsv); /* in case of fatal warnings */
                     if (strict) {
                         vFAIL4("False [] range \"%*.*s\"", w, w, rangebegin);
                     }
@@ -12315,7 +12408,6 @@ parseit:
                         cp_list = add_cp_to_invlist(cp_list, '-');
                         cp_list = add_cp_to_invlist(cp_list, prevvalue);
                     }
-                    SvREFCNT_inc_simple_void_NN(listsv);
                }
 
                range = 0; /* this was not a true range */
@@ -12851,13 +12943,12 @@ parseit:
 
        ret = reg(pRExC_state, 1, &reg_flags, depth+1);
 
-       *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED);
+       *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_UTF8);
 
        RExC_parse = save_parse;
        RExC_end = save_end;
        RExC_in_multi_char_class = 0;
         SvREFCNT_dec_NN(multi_char_matches);
-        SvREFCNT_dec_NN(listsv);
         return ret;
     }
 
@@ -13015,7 +13106,6 @@ parseit:
             RExC_parse = (char *) cur_parse;
 
             SvREFCNT_dec(posixes);
-            SvREFCNT_dec_NN(listsv);
             SvREFCNT_dec(cp_list);
             return ret;
         }
@@ -13383,7 +13473,7 @@ parseit:
         else {
             RExC_emit = orig_emit;
         }
-        return END;
+        return orig_emit;
     }
 
     /* If we didn't do folding, it's because some information isn't available
@@ -13517,7 +13607,6 @@ parseit:
             }
 
             SvREFCNT_dec_NN(cp_list);
-            SvREFCNT_dec_NN(listsv);
             return ret;
         }
     }
@@ -13605,7 +13694,6 @@ parseit:
        && ! HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION)
     {
        ARG_SET(ret, ANYOF_NONBITMAP_EMPTY);
-       SvREFCNT_dec_NN(listsv);
     }
     else {
        /* av[0] stores the character class description in its textual form:
@@ -13622,8 +13710,7 @@ parseit:
        SV *rv;
 
        av_store(av, 0, (HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION)
-                       ? listsv
-                       : (SvREFCNT_dec_NN(listsv), &PL_sv_undef));
+                       ? SvREFCNT_inc(listsv) : &PL_sv_undef);
        if (swash) {
            av_store(av, 1, swash);
            SvREFCNT_dec_NN(cp_list);
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 29a64dd..cd067f4 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -1082,6 +1082,14 @@ sub run_tests {
         eval "q(W) =~ /\\N{$name}/";
         ok ! $w, 'Verify that latin1 letter in name doesnt give warning';
 
+        # This tests the code path that restarts the parse when the recursive
+        # call to S_reg() from within S_grok_bslash_N() discovers that the
+        # pattern needs to be recalculated as UTF-8.  use eval to avoid
+        # needing literal Unicode in this source file:
+        my $r = eval "qr/\\N{\x{100}\x{100}}/";
+        isnt $r, undef, "Generated regex for multi-char UTF-8 charname"
+           or diag($@);
+        ok "\x{100}\x{100}" =~ $r, "which matches";
     }
 
     {
diff --git a/t/re/re_tests b/t/re/re_tests
index c41d529..7e7fc85 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -592,6 +592,10 @@ x(~~)*(?:(?:F)?)?  x~~     y       -       -
 (?(?!a)b|a)    a       y       $&      a
 (?(?=a)b|a)    a       n       -       -
 (?(?=a)a|b)    a       y       $&      a
+(?(?!\x{100})\x{100}|b)        \x{100} n       -       -
+(?(?!\x{100})b|\x{100})        \x{100} y       $&      \x{100}
+(?(?=\x{100})b|\x{100})        \x{100} n       -       -
+(?(?=\x{100})\x{100}|b)        \x{100} y       $&      \x{100}
 (?=(a+?))(\1ab)        aaab    y       $2      aab
 ^(?=(a+?))\1ab aaab    n       -       -
 (\w+:)+        one:    y       $1      one:
diff --git a/utf8.c b/utf8.c
index 1bf3f52..ec3a0ba 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4423,9 +4423,12 @@ The pointer to the PV of the C<dsv> is returned.
 char *
 Perl_sv_uni_display(pTHX_ SV *dsv, SV *ssv, STRLEN pvlim, UV flags)
 {
+    const char * const ptr =
+        isREGEXP(ssv) ? RX_WRAPPED((REGEXP*)ssv) : SvPVX_const(ssv);
+
     PERL_ARGS_ASSERT_SV_UNI_DISPLAY;
 
-     return Perl_pv_uni_display(aTHX_ dsv, (const U8*)SvPVX_const(ssv),
+    return Perl_pv_uni_display(aTHX_ dsv, (const U8*)ptr,
                                SvCUR(ssv), pvlim, flags);
 }
 

--
Perl5 Master Repository

Reply via email to