In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/512c0f5a351dd399dbb069e222741582044f88cc?hp=6b5cf123f371e012d9812b37b13d50c6e06bf555>

- Log -----------------------------------------------------------------
commit 512c0f5a351dd399dbb069e222741582044f88cc
Author: Karl Williamson <[email protected]>
Date:   Fri Sep 11 10:08:50 2015 -0600

    PATCH: [perl #125990] panic: reg_node overrun
    
    This is a result of a design flaw that I introduced in earlier releases
    when attempting to fix earlier design flaws in dealing with the outlier
    character ß, LATIN SMALL LETTER SHARP S.  The uppercase of this letter
    is SS, so that when comparing case-insensitively, it should match 'ss',
    and hence, in Unicode terminology, it folds to 'ss'.  This character is
    the only one representable without using UTF-8 whose fold is longer than
    1 byte, and so has to have special treatment.  Similarly, the sequence
    'ss' can match caselessly the single byte ß, and this is the only such
    sequence that can match something shorter than it, unless UTF-8 is
    involved.  The matter is complicated by the fact that under /di rules,
    the ß and 'ss' don't match each other, unless the target string is in
    UTF-8.  The solution I used earlier (and continue to use) was to create
    a special regnode EXACTFU_SS under /ui rules, in which any ß is folded
    to 'ss'.  But under /di rules, a regular EXACTF regnode is used, and any
    ß is retained as-is.
    
    The problem reported here arises when something during the sizing pass
    tells perl to use /ui rules rather than the /di rules that were in
    effect at the beginning.  Recall that perl uses /d rules, for backward
    compatibility, unless something overrides them.  This can be a 'use'
    declaration, an explicit character set pattern modifier, or something in
    the pattern.  This bug happens only with the final case.  There are
    several Unicode-defined constructs that can occur in patterns; if one is
    found, the perl interpreter infers that Unicode is desired, and switches
    from /d to /u for the whole pattern.  Two such constructs are a Unicode
    property, \p{}, and a Unicode named character, \N{}.  The
    problem-reproducing code for this ticket uses the latter.
    
    The problem was that the switch from /di to /ui was deferred until AFTER
    the sizing pass.  (A flag was set when one of these constructs was
    encountered to tell the parser to later do the switch.)
    
    During the second pass, the code realizes it is under /ui, so creates an
    EXACTFU_SS node and folds the ß into 'ss'.  But the first pass thought
    it was under /di, so it sized for just the ß, i.e., for 1 byte, so we
    exceed the allocated space and do a wild write.  This may not cause a
    problem if the malloc'd space had rounded-up and there were only a few
    of these ß characters.
    
    One solution I considered was just keeping a global count of the ß
    characters in EXACTF nodes.  One could just add these to the space
    reserved if /ui rules ended up being used.  The problem with this is that
    nodes that are near their maximum size without the extra space could
    exceed it with, and thus have to be split into 2 nodes, and the extra
    node would have an unplanned-for header, taking up more unaccounted-for
    space.  So that doesn't work.  One could also just reserve two bytes for
    every ß in an EXACTF node, thus wasting space unless /ui ends up being
    used.  But the bigger problem is that the code that splits nodes would
    have to be made more complicated.  It has to find a suitable splitting
    spot, by searching through the text of the node, and now it would have
    to deal with some of that space not being set.
    
    Instead, I opted to change the code so that when it finds one of these
    Unicode-defined constructs, it switches to /u immediately during the
    sizing pass.  That means that the parse afterwards knows that it is /u
    and allocates the correct space.  (We now have to remain in /u for the
    remainder of the pass, so some code had to change that reverted this.)
    This fixes the test case in the ticket.  But there remains a problem if
    the sizing has happened earlier in the parse before the construct that
    changes from /d to /u is encountered.  Like:
    
        qr/.....ß....\N{}/di
    
    The incorrect sizing has already happened by the time the \N{} is
    encountered.  One could solve this by restarting the parse whenever the
    /d goes to /u (under /i, as this issue isn't a problem except when
    folding ß).  That slows things down.  Instead, I opted to set a global
    flag whenever a ß is found in an EXACTF node.  If that flag isn't set at
    the time of the /d to /u switch, there's no need to restart the parse.
    
    A 'use utf8' or 'use 5.012' or higher selects /u over /d, so the problem
    did not happen with them, nor if the pattern has to be converted to
    UTF-8, which restarts the sizing pass, and it only happens with the
    sharp s character.  And probably unless there a several ß characters,
    the rounding-up of malloc space, would cause this to not be an issue.
    These explain why this hasn't been reported from the field.

M       regcomp.c
M       t/re/pat_advanced.t

commit b97943f7f63f0d1fe0d9501503c553accdbdbb81
Author: Karl Williamson <[email protected]>
Date:   Wed Sep 9 19:40:09 2015 -0600

    regcomp.c: Split an internal flag into 2
    
    This splits the flag used to communicate between parsing layers that the
    sizing pass needs to be restarted and the pattern upgraded to UTF-8.  It
    is split into a bit meaning to restart pass1 and a bit to do the
    upgrade.  This is in preparation for the next commit which will have a
    2nd reason to restart pass1.

M       regcomp.c

commit 539601d8d15042c7077d7fecb1acd97058ddd274
Author: Karl Williamson <[email protected]>
Date:   Fri Sep 11 10:09:06 2015 -0600

    regcomp.c: Reorder a test
    
    Prior to this commit, the code tested for some side effects before
    testing if the called function even succeeded.  This hasn't been a
    problem before, because the called function didn't fail when called
    from this context.  But a future commit will change that.

M       regcomp.c

commit 82a6ada44d6e80d9ab64fe6a0334f04487506fdb
Author: Karl Williamson <[email protected]>
Date:   Wed Sep 9 19:37:02 2015 -0600

    regcomp.c: Add assertion and parameter to macro
    
    It's clearer and safer to pass the name of a local variable to a macro,
    rather than assuming the macro knows the correct name.

M       regcomp.c

commit 0f0eec8fa0e9c6558ae31d368ff8b2c87314deee
Author: Karl Williamson <[email protected]>
Date:   Wed Sep 9 12:43:15 2015 -0600

    regcomp.c: Fix, clarify comments

M       regcomp.c
-----------------------------------------------------------------------

Summary of changes:
 regcomp.c           | 236 +++++++++++++++++++++++++++++++++-------------------
 t/re/pat_advanced.t |  22 +++++
 2 files changed, 171 insertions(+), 87 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 24af9d7..47b451c 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -214,6 +214,7 @@ struct RExC_state_t {
 #define RExC_mysv2     (pRExC_state->mysv2)
 
 #endif
+    bool        seen_unfolded_sharp_s;
 };
 
 #define RExC_flags     (pRExC_state->flags)
@@ -226,6 +227,17 @@ struct RExC_state_t {
 #define RExC_end       (pRExC_state->end)
 #define RExC_parse     (pRExC_state->parse)
 #define RExC_whilem_seen       (pRExC_state->whilem_seen)
+
+/* Set during the sizing pass when there is a LATIN SMALL LETTER SHARP S in any
+ * EXACTF node, hence was parsed under /di rules.  If later in the parse,
+ * something forces the pattern into using /ui rules, the sharp s should be
+ * folded into the sequence 'ss', which takes up more space than previously
+ * calculated.  This means that the sizing pass needs to be restarted.  (The
+ * node also becomes an EXACTFU_SS.)  For all other characters, an EXACTF node
+ * that gets converted to /ui (and EXACTFU) occupies the same amount of space,
+ * so there is no need to resize [perl #125990]. */
+#define RExC_seen_unfolded_sharp_s (pRExC_state->seen_unfolded_sharp_s)
+
 #ifdef RE_TRACK_PATTERN_OFFSETS
 #define RExC_offsets   (pRExC_state->rxi->u.offsets) /* I am not like the
                                                          others */
@@ -299,7 +311,9 @@ struct RExC_state_t {
 #define        SPSTART         0x04    /* Starts with * or + */
 #define POSTPONED      0x08    /* (?1),(?&name), (??{...}) or similar */
 #define TRYAGAIN       0x10    /* Weeded out a declaration. */
-#define RESTART_UTF8    0x20    /* Restart, need to calcuate sizes as UTF-8 */
+#define RESTART_PASS1   0x20    /* Need to restart sizing pass */
+#define NEED_UTF8       0x40    /* In conjunction with RESTART_PASS1, need to
+                                   calcuate sizes as UTF-8 */
 
 #define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1)
 
@@ -318,12 +332,30 @@ struct RExC_state_t {
 #define PAREN_SET(u8str,paren) PBYTE(u8str,paren) |= PBITVAL(paren)
 #define PAREN_UNSET(u8str,paren) PBYTE(u8str,paren) &= (~PBITVAL(paren))
 
-#define REQUIRE_UTF8   STMT_START {                                       \
+#define REQUIRE_UTF8(flagp) STMT_START {                                   \
                                      if (!UTF) {                           \
-                                         *flagp = RESTART_UTF8;            \
+                                         assert(PASS1);                    \
+                                         *flagp = RESTART_PASS1|NEED_UTF8; \
                                          return NULL;                      \
                                      }                                     \
-                        } STMT_END
+                             } STMT_END
+
+/* Change from /d into /u rules, and restart the parse if we've already seen
+ * something whose size would increase as a result, by setting *flagp and
+ * returning 'restart_retval'.  RExC_uni_semantics is a flag that indicates
+ * we've change to /u during the parse.  */
+#define REQUIRE_UNI_RULES(flagp, restart_retval)                            \
+    STMT_START {                                                            \
+            if (DEPENDS_SEMANTICS) {                                        \
+                assert(PASS1);                                              \
+                set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);      \
+                RExC_uni_semantics = 1;                                     \
+                if (RExC_seen_unfolded_sharp_s) {                           \
+                    *flagp |= RESTART_PASS1;                                \
+                    return restart_retval;                                  \
+                }                                                           \
+            }                                                               \
+    } STMT_END
 
 /* This converts the named class defined in regcomp.h to its equivalent class
  * number defined in handy.h. */
@@ -6567,7 +6599,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
 
     /* ignore the utf8ness if the pattern is 0 length */
     RExC_utf8 = RExC_orig_utf8 = (plen == 0 || IN_BYTES) ? 0 : SvUTF8(pat);
+
     RExC_uni_semantics = 0;
+    RExC_seen_unfolded_sharp_s = 0;
     RExC_contains_locale = 0;
     RExC_contains_i = 0;
     RExC_strict = cBOOL(pm_flags & RXf_PMf_STRICT);
@@ -6588,8 +6622,8 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
         });
 
   redo_first_pass:
-    /* we jump here if we upgrade the pattern to utf8 and have to
-     * recompile */
+    /* we jump here if we have to recompile, e.g., from upgrading the pattern
+     * to utf8 */
 
     if ((pm_flags & PMf_USE_RE_EVAL)
                /* this second condition covers the non-regex literal case,
@@ -6623,7 +6657,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
     if (rx_flags & PMf_FOLD) {
         RExC_contains_i = 1;
     }
-    if (RExC_utf8 && initial_charset == REGEX_DEPENDS_CHARSET) {
+    if (   initial_charset == REGEX_DEPENDS_CHARSET
+        && (RExC_utf8 ||RExC_uni_semantics))
+    {
 
        /* Set to use unicode semantics if the pattern is in utf8 and has the
         * 'depends' charset specified, as it means unicode when utf8  */
@@ -6709,9 +6745,16 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int 
pat_count,
         at least some part of the pattern, and therefore must convert the whole
         thing.
         -- dmq */
-        if (flags & RESTART_UTF8) {
-            S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen,
+        if (flags & RESTART_PASS1) {
+            if (flags & NEED_UTF8) {
+                S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen,
                                     pRExC_state->num_code_blocks);
+            }
+            else {
+                DEBUG_PARSE_r(PerlIO_printf(Perl_debug_log,
+                "Need to redo pass 1\n"));
+            }
+
             goto redo_first_pass;
         }
         Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing 
pass, flags=%#"UVxf"", (UV) flags);
@@ -9768,10 +9811,10 @@ S_parse_lparen_question_flags(pTHX_ RExC_state_t 
*pRExC_state)
 #endif
 
 /* Returns NULL, setting *flagp to TRYAGAIN at the end of (?) that only sets
-   flags. Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan
-   needs to be restarted.
-   Otherwise would only return NULL if regbranch() returns NULL, which
-   cannot happen.  */
+   flags. Returns NULL, setting *flagp to RESTART_PASS1 if the sizing scan
+   needs to be restarted, or'd with NEED_UTF8 if the pattern needs to be
+   upgraded to UTF-8.  Otherwise would only return NULL if regbranch() returns
+   NULL, which cannot happen.  */
 STATIC regnode *
 S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
     /* paren: Parenthesized? 0=top; 1,2=inside '(': changed to letter.
@@ -10299,8 +10342,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                            ret->flags = 1;
 
                         tail = reg(pRExC_state, 1, &flag, depth+1);
-                        if (flag & RESTART_UTF8) {
-                            *flagp = RESTART_UTF8;
+                        if (flag & (RESTART_PASS1|NEED_UTF8)) {
+                            *flagp = flag & (RESTART_PASS1|NEED_UTF8);
                             return NULL;
                         }
                         REGTAIL(pRExC_state, ret, tail);
@@ -10388,8 +10431,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                     REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 
0));
                     br = regbranch(pRExC_state, &flags, 1,depth+1);
                    if (br == NULL) {
-                        if (flags & RESTART_UTF8) {
-                            *flagp = RESTART_UTF8;
+                        if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                            *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                             return NULL;
                         }
                         FAIL2("panic: regbranch returned NULL, flags=%#"UVxf"",
@@ -10408,8 +10451,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
                         lastbr = reganode(pRExC_state, IFTHEN, 0);
 
                         if (!regbranch(pRExC_state, &flags, 1,depth+1)) {
-                            if (flags & RESTART_UTF8) {
-                                *flagp = RESTART_UTF8;
+                            if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                                 return NULL;
                             }
                             FAIL2("panic: regbranch returned NULL, 
flags=%#"UVxf"",
@@ -10506,8 +10549,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
     /*     branch_len = (paren != 0); */
 
     if (br == NULL) {
-        if (flags & RESTART_UTF8) {
-            *flagp = RESTART_UTF8;
+        if (flags & (RESTART_PASS1|NEED_UTF8)) {
+            *flagp = flags & (RESTART_PASS1|NEED_UTF8);
             return NULL;
         }
         FAIL2("panic: regbranch returned NULL, flags=%#"UVxf"", (UV) flags);
@@ -10553,8 +10596,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
         br = regbranch(pRExC_state, &flags, 0, depth+1);
 
        if (br == NULL) {
-            if (flags & RESTART_UTF8) {
-                *flagp = RESTART_UTF8;
+            if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                 return NULL;
             }
             FAIL2("panic: regbranch returned NULL, flags=%#"UVxf"", (UV) 
flags);
@@ -10685,8 +10728,12 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
 
     /* Check for proper termination. */
     if (paren) {
-        /* restore original flags, but keep (?p) */
+        /* restore original flags, but keep (?p) and, if we've changed from /d
+         * rules to /u, keep the /u */
        RExC_flags = oregflags | (RExC_flags & RXf_PMf_KEEPCOPY);
+        if (DEPENDS_SEMANTICS && RExC_uni_semantics) {
+            set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
+        }
        if (RExC_parse >= RExC_end || *nextchar(pRExC_state) != ')') {
            RExC_parse = oregcomp_parse;
            vFAIL("Unmatched (");
@@ -10715,8 +10762,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 
*flagp,U32 depth)
  *
  * Implements the concatenation operator.
  *
- * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
- * restarted.
+ * Returns NULL, setting *flagp to RESTART_PASS1 if the sizing scan needs to be
+ * restarted, or'd with NEED_UTF8 if the pattern needs to be upgraded to UTF-8
  */
 STATIC regnode *
 S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth)
@@ -10755,8 +10802,8 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, I32 first, U32 depth)
        if (latest == NULL) {
            if (flags & TRYAGAIN)
                continue;
-            if (flags & RESTART_UTF8) {
-                *flagp = RESTART_UTF8;
+            if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                 return NULL;
             }
             FAIL2("panic: regpiece returned NULL, flags=%#"UVxf"", (UV) flags);
@@ -10798,8 +10845,8 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, I32 first, U32 depth)
  *
  * Returns NULL, setting *flagp to TRYAGAIN if regatom() returns NULL with
  * TRYAGAIN.
- * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
- * restarted.
+ * Returns NULL, setting *flagp to RESTART_PASS1 if the sizing scan needs to be
+ * restarted, or'd with NEED_UTF8 if the pattern needs to be upgraded to UTF-8
  */
 STATIC regnode *
 S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
@@ -10828,8 +10875,8 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
 
     ret = regatom(pRExC_state, &flags,depth+1);
     if (ret == NULL) {
-       if (flags & (TRYAGAIN|RESTART_UTF8))
-           *flagp |= flags & (TRYAGAIN|RESTART_UTF8);
+       if (flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8))
+           *flagp |= flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8);
         else
             FAIL2("panic: regatom returned NULL, flags=%#"UVxf"", (UV) flags);
        return(NULL);
@@ -11106,14 +11153,17 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
   * sequence. *node_p * will be set to a generated node returned by this
   * function calling S_reg().
   *
-  * The final possibility, which happens only when the fourth one would
-  * otherwise be in effect, is that one of those code points requires the
+  * The final possibility, which happens is that it is premature to be calling
+  * this function; that pass1 needs to be restarted.  This can happen when this
+  * changes from /d to /u rules, or when the pattern needs to be upgraded to
+  * UTF-8.  The latter occurs only when the fourth possibility would otherwise
+  * be in effect, and is because one of those code points requires the
   * pattern to be recompiled as UTF-8.  The function returns FALSE, and sets
-  * the RESTART_UTF8 flag in *flagp.  When this happens, the caller needs to
-  * desist from continuing parsing, and return this information to its caller.
-  * This is not set for when there is only one code point, as this can be
-  * called as part of an ANYOF node, and they can store above-Latin1 code
-  * points without the pattern having to be in UTF-8.
+  * the RESTART_PASS1 and NEED_UTF8 flags in *flagp, as appropriate.  When this
+  * happens, the caller needs to desist from continuing parsing, and return
+  * this information to its caller.  This is not set for when there is only one
+  * code point, as this can be called as part of an ANYOF node, and they can
+  * store above-Latin1 code points without the pattern having to be in UTF-8.
   *
   * For non-single-quoted regexes, the tokenizer has resolved character and
   * sequence names inside \N{...} into their Unicode values, normalizing the
@@ -11202,7 +11252,8 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
        vFAIL("\\N{NAME} must be resolved by the lexer");
     }
 
-    RExC_uni_semantics = 1; /* Unicode named chars imply Unicode semantics */
+    REQUIRE_UNI_RULES(flagp, FALSE); /* Unicode named chars imply Unicode
+                                        semantics */
 
     if (endbrace == RExC_parse) {   /* empty: \N{} */
         if (cp_count) {
@@ -11342,8 +11393,8 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
 
         if (node_p) {
             if (!(*node_p = reg(pRExC_state, 1, &flags, depth+1))) {
-                if (flags & RESTART_UTF8) {
-                    *flagp = RESTART_UTF8;
+                if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                    *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                     return FALSE;
                 }
                 FAIL2("panic: reg returned NULL to grok_bslash_N, 
flags=%#"UVxf"",
@@ -11672,8 +11723,8 @@ S_backref_value(char *p)
 
    Returns NULL, setting *flagp to TRYAGAIN if reg() returns NULL with
    TRYAGAIN.
-   Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs to be
-   restarted.
+   Returns NULL, setting *flagp to RESTART_PASS1 if the sizing scan needs to be
+   restarted, or'd with NEED_UTF8 if the pattern needs to be upgraded to UTF-8
    Otherwise does not return NULL.
 */
 
@@ -11736,16 +11787,16 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth)
                        (bool) RExC_strict,
                        TRUE, /* Allow an optimized regnode result */
                        NULL);
-       if (*RExC_parse != ']') {
-           RExC_parse = oregcomp_parse;
-           vFAIL("Unmatched [");
-       }
         if (ret == NULL) {
-            if (*flagp & RESTART_UTF8)
+            if (*flagp & (RESTART_PASS1|NEED_UTF8))
                 return NULL;
             FAIL2("panic: regclass returned NULL to regatom, flags=%#"UVxf"",
                   (UV) *flagp);
         }
+       if (*RExC_parse != ']') {
+           RExC_parse = oregcomp_parse;
+           vFAIL("Unmatched [");
+       }
        nextchar(pRExC_state);
         Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */
        break;
@@ -11762,8 +11813,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                    }
                    goto tryagain;
                }
-                if (flags & RESTART_UTF8) {
-                    *flagp = RESTART_UTF8;
+                if (flags & (RESTART_PASS1|NEED_UTF8)) {
+                    *flagp = flags & (RESTART_PASS1|NEED_UTF8);
                     return NULL;
                 }
                 FAIL2("panic: reg returned NULL to regatom, flags=%#"UVxf"",
@@ -11927,7 +11978,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                         NOT_REACHED; /*NOTREACHED*/
                 }
                 RExC_parse = endbrace;
-                RExC_uni_semantics = 1;
+                REQUIRE_UNI_RULES(flagp, NULL);
 
                 if (PASS2 && op >= BOUNDA) {  /* /aa is same as /a */
                     OP(ret) = BOUNDU;
@@ -12039,8 +12090,10 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                                (bool) RExC_strict,
                                TRUE, /* Allow an optimized regnode result */
                                NULL);
-                /* regclass() can only return RESTART_UTF8 if multi-char folds
-                   are allowed.  */
+                if (*flagp & RESTART_PASS1)
+                    return NULL;
+                /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
+                 * multi-char folds are allowed.  */
                 if (!ret)
                     FAIL2("panic: regclass returned NULL to regatom, 
flags=%#"UVxf"",
                           (UV) *flagp);
@@ -12077,7 +12130,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                 break;
             }
 
-            if (*flagp & RESTART_UTF8)
+            if (*flagp & RESTART_PASS1)
                 return NULL;
             RExC_parse--;
             goto defchar;
@@ -12285,7 +12338,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
              * string's UTF8ness.  The reason to do this is that EXACTF is not
              * trie-able, EXACTFU is.
              *
-             * Similarly, we can convert EXACTFL nodes to EXACTFU if they
+             * Similarly, we can convert EXACTFL nodes to EXACTFLU8 if they
              * contain only above-Latin1 characters (hence must be in UTF8),
              * which don't participate in folds with Latin1-range characters,
              * as the latter's folds aren't known until runtime.  (We don't
@@ -12402,8 +12455,10 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                                             flagp,
                                             depth)
                         ) {
-                            if (*flagp & RESTART_UTF8)
-                                FAIL("panic: grok_bslash_N set RESTART_UTF8");
+                            if (*flagp & NEED_UTF8)
+                                FAIL("panic: grok_bslash_N set NEED_UTF8");
+                            if (*flagp & RESTART_PASS1)
+                                return NULL;
 
                             /* Here, it wasn't a single code point.  Go close
                              * up this EXACTish node.  The switch() prior to
@@ -12413,7 +12468,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                         }
                         p = RExC_parse;
                         if (ender > 0xff) {
-                            REQUIRE_UTF8;
+                            REQUIRE_UTF8(flagp);
                         }
                         break;
                    case 'r':
@@ -12460,7 +12515,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                                goto recode_encoding;
                            }
                            if (ender > 0xff) {
-                               REQUIRE_UTF8;
+                               REQUIRE_UTF8(flagp);
                            }
                            break;
                        }
@@ -12498,7 +12553,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                                 }
                            }
                             else {
-                               REQUIRE_UTF8;
+                               REQUIRE_UTF8(flagp);
                            }
                            break;
                        }
@@ -12543,7 +12598,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                            STRLEN numlen = 3;
                            ender = grok_oct(p, &numlen, &flags, NULL);
                            if (ender > 0xff) {
-                               REQUIRE_UTF8;
+                               REQUIRE_UTF8(flagp);
                            }
                            p += numlen;
                             if (PASS2   /* like \08, \178 */
@@ -12565,7 +12620,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                            ender = reg_recode((const char)(U8)ender, &enc);
                            if (!enc && PASS2)
                                ckWARNreg(p, "Invalid escape in the specified 
encoding");
-                           REQUIRE_UTF8;
+                           REQUIRE_UTF8(flagp);
                        }
                        break;
                    case '\0':
@@ -12705,15 +12760,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth)
                         /* See if the character's fold differs between /d and
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
-                        if (maybe_exactfu
+                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+                            RExC_seen_unfolded_sharp_s = 1;
+                            maybe_exactfu = FALSE;
+                        }
+                        else if (maybe_exactfu
                             && (PL_fold[ender] != PL_fold_latin1[ender]
 #if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
    || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
                                       || UNICODE_DOT_DOT_VERSION > 0)
-                                || ender == LATIN_SMALL_LETTER_SHARP_S
-                                || (len > 0
-                                   && isALPHA_FOLD_EQ(ender, 's')
-                                   && isALPHA_FOLD_EQ(*(s-1), 's'))
+                                || (   len > 0
+                                    && isALPHA_FOLD_EQ(ender, 's')
+                                    && isALPHA_FOLD_EQ(*(s-1), 's'))
 #endif
                         )) {
                             maybe_exactfu = FALSE;
@@ -12724,7 +12782,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
                      * we have an array that finds its fold quickly */
                     *(s++) = (char) ender;
                 }
-                else {  /* FOLD and UTF */
+                else {  /* FOLD, and UTF (or sharp s) */
                     /* Unlike the non-fold case, we do actually have to
                      * calculate the results here in pass 1.  This is for two
                      * reasons, the folded length may be longer than the
@@ -13363,9 +13421,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, 
SV** return_invlist,
         set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
     }
 
-    RExC_uni_semantics = 1;     /* The use of this operator implies /u.  This
-                                   is required so that the compile time values
-                                   are valid in all runtime cases */
+    REQUIRE_UNI_RULES(flagp, NULL);   /* The use of this operator implies /u.
+                                         This is required so that the compile
+                                         time values are valid in all runtime
+                                         cases */
 
     /* This will return only an ANYOF regnode, or (unlikely) something smaller
      * (such as EXACT).  Thus we can skip most everything if just sizing.  We
@@ -13409,8 +13468,8 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, 
SV** return_invlist,
                         RExC_parse++;
                     }
 
-                    /* regclass() can only return RESTART_UTF8 if multi-char
-                       folds are allowed.  */
+                    /* regclass() can only return RESTART_PASS1 and NEED_UTF8
+                     * if multi-char folds are allowed.  */
                     if (!regclass(pRExC_state, flagp,depth+1,
                                   is_posix_class, /* parse the whole char
                                                      class only if not a
@@ -13669,8 +13728,8 @@ redo_curchar:
                 break;
 
             case '\\':
-                /* regclass() can only return RESTART_UTF8 if multi-char
-                   folds are allowed.  */
+                /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
+                 * multi-char folds are allowed.  */
                 if (!regclass(pRExC_state, flagp,depth+1,
                               TRUE, /* means parse just the next thing */
                               FALSE, /* don't allow multi-char folds */
@@ -13696,8 +13755,8 @@ redo_curchar:
                     RExC_parse++;
                 }
 
-                /* regclass() can only return RESTART_UTF8 if multi-char
-                   folds are allowed.  */
+                /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
+                 * multi-char folds are allowed.  */
                 if(!regclass(pRExC_state, flagp,depth+1,
                              is_posix_class, /* parse the whole char class
                                                 only if not a posix class */
@@ -13996,8 +14055,8 @@ redo_curchar:
      * already has all folding taken into consideration, and we don't want
      * regclass() to add to that */
     RExC_flags &= ~RXf_PMf_FOLD;
-    /* regclass() can only return RESTART_UTF8 if multi-char folds are allowed.
-     */
+    /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if multi-char
+     * folds are allowed.  */
     node = regclass(pRExC_state, flagp,depth+1,
                     FALSE, /* means parse the whole char class */
                     FALSE, /* don't allow multi-char folds */
@@ -14218,8 +14277,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth,
      * are extra bits for \w, etc. in locale ANYOFs, as what these match is not
      * determinable at compile time
      *
-     * Returns NULL, setting *flagp to RESTART_UTF8 if the sizing scan needs
-     * to be restarted.  This can only happen if ret_invlist is non-NULL.
+     * Returns NULL, setting *flagp to RESTART_PASS1 if the sizing scan needs
+     * to be restarted, or'd with NEED_UTF8 if the pattern needs to be upgraded
+     * to UTF-8.  This can only happen if ret_invlist is non-NULL.
      */
 
     UV prevvalue = OOB_UNICODE, save_prevvalue = OOB_UNICODE;
@@ -14473,8 +14533,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                                         depth)
                     ) {
 
-                        if (*flagp & RESTART_UTF8)
-                            FAIL("panic: grok_bslash_N set RESTART_UTF8");
+                        if (*flagp & NEED_UTF8)
+                            FAIL("panic: grok_bslash_N set NEED_UTF8");
+                        if (*flagp & RESTART_PASS1)
+                            return NULL;
 
                         if (cp_count < 0) {
                             vFAIL("\\N in a character class must be a named 
character: \\N{...}");
@@ -14695,7 +14757,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth,
                                                 named */
 
                /* \p means they want Unicode semantics */
-               RExC_uni_semantics = 1;
+               REQUIRE_UNI_RULES(flagp, NULL);
                }
                break;
            case 'n':   value = '\n';                   break;
@@ -15081,7 +15143,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth,
        /* non-Latin1 code point implies unicode semantics.  Must be set in
         * pass1 so is there for the whole of pass 2 */
        if (value > 255) {
-           RExC_uni_semantics = 1;
+            REQUIRE_UNI_RULES(flagp, NULL);
        }
 
         /* Ready to process either the single value, or the completed range.
@@ -15363,7 +15425,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth,
 
        ret = reg(pRExC_state, 1, &reg_flags, depth+1);
 
-       *flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_UTF8);
+       *flagp |= 
reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_PASS1|NEED_UTF8);
 
        RExC_parse = save_parse;
        RExC_end = save_end;
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 33647f3..e221ece 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -2411,6 +2411,27 @@ EOF
                             "Overlapping ranges in user-defined properties");
     }
 
+    { # [perl #125990], the final 2 tests below each caused a panic.
+        # The \0's are not necessary; it could be a printable character
+        # instead, but were in the ticket, so using them.
+        my $sharp_s = chr utf8::unicode_to_native(0xdf);
+        my $string        = ("\0" x 8)
+                          . ($sharp_s x 3)
+                          . ("\0" x 42)
+                          .  "ý";
+        my $folded_string = ("\0" x 8)
+                          . ("ss" x 3)
+                          . ("\0" x 42)
+                          .  "ý";
+        utf8::downgrade($string);
+        utf8::downgrade($folded_string);
+
+        like($string, qr/$string/i, "LATIN SMALL SHARP S matches itself under 
/id");
+        unlike($folded_string, qr/$string/i, "LATIN SMALL SHARP S doesn't 
match 'ss' under /di");
+        like($folded_string, qr/\N{}$string/i, "\\N{} earlier than LATIN SMALL 
SHARP S transforms /di into /ui, matches 'ss'");
+        like($folded_string, qr/$string\N{}/i, "\\N{} after LATIN SMALL SHARP 
S transforms /di into /ui, matches 'ss'");
+    }
+
     { # Regexp:Grammars was broken:
   # 
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-06/msg01290.html
         fresh_perl_like('use warnings; "abc" =~ qr{(?&foo){0}abc(?<foo>)}',
@@ -2428,6 +2449,7 @@ EOF
                      "buffer overflow in TRIE_STORE_REVCHAR");
     }
 
+
     # !!! NOTE that tests that aren't at all likely to crash perl should go
     # a ways above, above these last ones.
 

--
Perl5 Master Repository

Reply via email to