[perl.git] branch blead, updated. v5.21.4-272-g5920672

Karl Williamson Tue, 07 Oct 2014 07:52:09 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/59206727b78ef2c3c1f5b83c0a4d9a24a867c5da?hp=8f5839a98a66edafefd2ffd3056e5e3fc64e1d27>


- Log -----------------------------------------------------------------
commit 59206727b78ef2c3c1f5b83c0a4d9a24a867c5da
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 16:13:31 2014 -0600

    Fix EBCDIC-only bug with /[...]/
    
    Perl has special handling with ranges in bracketed character classes
    when the end points are either both uppercase or both lowercase.  This
    used a different test which is incorrect.  The consequences were that a
    some Latin1 ranges would be rendered as empty.

M       regcomp.c
M       t/re/pat.t

commit c66a6e35c0c9b556e732518ae78e397bc93ad7ac
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 14:55:38 2014 -0600

    regcomp.c: Fix EBCDIC-only memory leak

M       regcomp.c

commit 7622251e573fa1b565b87088b4906fdbdfb0b4bd
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 14:55:00 2014 -0600

    regcomp.c: Fix typo in comment

M       regcomp.c

commit 09e4339761388239d17da23bf3fa0c882a0b04bf
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 12:14:36 2014 -0600

    Document special EBCDIC [...] literal range handling

M       pod/perldelta.pod
M       pod/perlrecharclass.pod

commit 423df6e4ea0fd95811eb041174e9e88a3e25975f
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 12:02:25 2014 -0600

    \N{...} now treated as literal for EBCDIC qr/[...]/
    
    Perl specially handles some ranges in bracketed character classes to
    deal with gaps between, for example, "i" and "j".  The next commit will
    better document this.  This special handling is only invoked when both
    ends of the range are literals.  This commit extends that so that a
    \N{...} construct is considered a literal for this purpose.
    
    There really shouldn't be any difference in the handling of "A" vs
    "\N{LATIN CAPITAL LETTER A}".  And this commit makes them be handled the
    same.
    
    It further makes \N{U+0041} be treated the same.  The programmer here is
    effectively saying "A", and invoking the Unicode definition for it, so
    it stands to reason that the range s/he wants should also be interpreted
    the Unicode way.

M       regcomp.c
M       t/re/pat.t

commit 8498497fc66084a5698b2e8e44bd811ece344d84
Author: Karl Williamson <[email protected]>
Date:   Mon Oct 6 11:49:00 2014 -0600

    re/pat.t: Add some EBCDIC tests
    
    These test the special handling of ranges in bracketed character classes
    where the endpoints are alphanumeric literals, but there is a code point
    gap in the range that would otherwise match but shouldn't.

M       t/re/pat.t

commit 97651d61d67d90d9b6e1cbef032da8bee3f503cc
Author: Karl Williamson <[email protected]>
Date:   Sat Oct 4 11:13:45 2014 -0600

    toke.c: Extract into a named string constant
    
    This moves a literal double-quoted string into a const variable, so that
    it only need be written out once, and its  size can be compiler-computed
    and used as another const variable.  Hence, mnemonics are used instead
    of bare numbers.

M       toke.c

commit 85fba779aa33088791f9baea8a0627082b44470c
Author: Karl Williamson <[email protected]>
Date:   Sat Oct 4 10:09:46 2014 -0600

    toke.c: Update, clarify comments, fix some indents

M       toke.c

commit b0e1d434ae2102b015964eb25c3571ffc6c083b5
Author: Karl Williamson <[email protected]>
Date:   Thu Oct 2 22:46:31 2014 -0600

    reg_mesg.t: Add comments; remove trailing white-space

M       t/re/reg_mesg.t

commit 1a83413c438a835495931877ff113ba7f7ef57cc
Author: Karl Williamson <[email protected]>
Date:   Wed Oct 1 21:00:28 2014 -0600

    perlapi: Clarify two entries

M       handy.h
-----------------------------------------------------------------------

Summary of changes:
 handy.h                 |   9 +++--
 pod/perldelta.pod       |  17 ++++++--
 pod/perlrecharclass.pod |  35 ++++++++++++++---
 regcomp.c               |  13 ++++--
 t/re/pat.t              |  45 ++++++++++++++++++++-
 t/re/reg_mesg.t         |  16 +++++---
 toke.c                  | 102 ++++++++++++++++++++++++++++--------------------
 7 files changed, 173 insertions(+), 64 deletions(-)

diff --git a/handy.h b/handy.h
index 511bba3..5e0c86e 100644
--- a/handy.h
+++ b/handy.h
@@ -535,7 +535,9 @@ for.  If Perl can determine that the current locale is a 
UTF-8 locale, it uses
 the published Unicode rules; otherwise, it uses the C library function that
 gives the named classification.  For example, C<isDIGIT_LC()> when not in a
 UTF-8 locale returns the result of calling C<isdigit()>.  FALSE is always
-returned if the input won't fit into an octet.
+returned if the input won't fit into an octet.  On some platforms where the C
+library function is known to be defective, Perl changes its result to follow
+the POSIX standard's rules.
 
 Variant C<isFOO_LC_uvchr> is like C<isFOO_LC>, but is defined on any UV.  It
 returns the same as C<isFOO_LC> for input code points less than 256, and
@@ -704,8 +706,9 @@ word character includes more than the standard C language 
meaning of
 alphanumeric.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isWORDCHAR_A>, C<isWORDCHAR_L1>, C<isWORDCHAR_uni>, C<isWORDCHAR_utf8>,
-C<isWORDCHAR_LC>, C<isWORDCHAR_LC_uvchr>, and C<isWORDCHAR_LC_utf8>.
+C<isWORDCHAR_A>, C<isWORDCHAR_L1>, C<isWORDCHAR_uni>, and C<isWORDCHAR_utf8>.
+C<isWORDCHAR_LC>, C<isWORDCHAR_LC_uvchr>, and C<isWORDCHAR_LC_utf8> are also as
+described there, but additionally include the platform's native underscore.
 
 =for apidoc Am|bool|isXDIGIT|char ch
 Returns a boolean indicating whether the specified character is a hexadecimal
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index d4e0c1b..ac08859 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -205,7 +205,12 @@ section.
 
 =item *
 
-XXX Description of the change here
+Clarifications have been added to L<perlrecharclass/Character Ranges>
+to the effect that Perl guarantees that C<[A-Z]>, C<[a-z]>, C<[0-9]> and
+any subranges thereof in regular expression bracketed character classes
+are guaranteed to match exactly what a naive English speaker would
+expect them to match, even on platforms (such as EBCDIC) where special
+handling is required to accomplish this.
 
 =back
 
@@ -352,9 +357,15 @@ L</Modules and Pragmata> section.
 
 =over 4
 
-=item XXX-some-platform
+=item EBCDIC
 
-XXX
+Special handling is required on EBCDIC platforms to get C<qr/[i-j]/> to
+match only C<"i"> and C<"j">, since there are 7 characters between the
+code points for C<"i"> and C<"j">.  This special handling had only been
+invoked when both ends of the range are literals.  Now it is also
+invoked if any of the C<\N{...}> forms for specifying a character by
+name or Unicode code point is used instead of a literal.  See
+L<perlrecharclass/Character Ranges>.
 
 =back
 
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 3a38e56..4ab99ac 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -600,11 +600,6 @@ your set of characters to be matched and its position in 
the class is such
 that it could be considered part of a range, you must escape that hyphen
 with a backslash.
 
-The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the sense
-they always match exactly the 26 upper/lower case letters, regardless
-of the platform (this only effects EBCDIC, which would otherwise include 
-some non-letters).
-
 Examples:
 
  [a-z]       #  Matches a character that is a lower case ASCII letter.
@@ -616,6 +611,36 @@ Examples:
  ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
              #  (But not on an EBCDIC platform).
 
+Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match.  That is, C<[A-Z]> matches the 26 ASCII uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits.  Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">.  This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B).  But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set.  For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92.   Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90.  This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j]               #  Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}]  # Same
+ [i-\N{U+6A}]        #  Same
+ [\N{U+69}-\N{U+6A}] #  Same
+ [\x{89}-\x{91}]     #  Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}]          #  Same
+ [\x{89}-j]          #  Same
+ [i-J]               #  Matches, 0x89 ("i") .. 0xC1 ("J"); special
+                     #  handling doesn't apply because range is mixed
+                     #  case
 
 =head3 Negation
 
diff --git a/regcomp.c b/regcomp.c
index ca32120..ebda789 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13813,6 +13813,10 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
                         continue;   /* Back to top of loop to get next char */
                     }
                     /* Here, is a single code point, and <value> contains it */
+#ifdef EBCDIC
+                    /* We consider named characters to be literal characters */
+                    literal_endpoint++;
+#endif
                 }
                 break;
            case 'p':
@@ -14416,19 +14420,20 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 
*flagp, U32 depth,
              * included.  literal_endpoint==2 means both ends of the range used
              * a literal character, not \x{foo} */
            if (literal_endpoint == 2
-                && ((prevvalue >= 'a' && value <= 'z')
-                    || (prevvalue >= 'A' && value <= 'Z')))
+                && ((isLOWER_A(prevvalue) && isLOWER_A(value))
+                    || (isUPPER_A(prevvalue) && isUPPER_A(value))))
             {
                 _invlist_intersection(this_range, PL_XPosix_ptrs[_CC_ASCII],
                                       &this_range);
 
-                /* Since this above only contains ascii, the intersection of it
-                 * with anything will still yield only ascii */
+                /* Since 'this_range' now only contains ascii, the intersection
+                 * of it with anything will still yield only ascii */
                 _invlist_intersection(this_range, PL_XPosix_ptrs[_CC_ALPHA],
                                       &this_range);
             }
             _invlist_union(cp_foldable_list, this_range, &cp_foldable_list);
             literal_endpoint = 0;
+            SvREFCNT_dec_NN(this_range);
 #endif
         }
 
diff --git a/t/re/pat.t b/t/re/pat.t
index ac6bb3f..e532054 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -22,7 +22,7 @@ BEGIN {
     skip_all_without_unicode_tables();
 }
 
-plan tests => 730;  # Update this when adding/deleting tests.
+plan tests => 755;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1485,6 +1485,44 @@ EOP
           'undefining *^R within (??{}) does not result in a crash';
     }
 
+    SKIP: {   # Test literal range end point special handling
+        unless ($::IS_EBCDIC) {
+            skip "Valid only for EBCDIC", 24;
+        }
+
+        like("\x89", qr/[i-j]/, '"\x89" should match [i-j]');
+        unlike("\x8A", qr/[i-j]/, '"\x8A" shouldnt match [i-j]');
+        unlike("\x90", qr/[i-j]/, '"\x90" shouldnt match [i-j]');
+        like("\x91", qr/[i-j]/, '"\x91" should match [i-j]');
+
+        like("\x89", qr/[i-\N{LATIN SMALL LETTER J}]/, '"\x89" should match 
[i-\N{LATIN SMALL LETTER J}]');
+        unlike("\x8A", qr/[i-\N{LATIN SMALL LETTER J}]/, '"\x8A" shouldnt 
match [i-\N{LATIN SMALL LETTER J}]');
+        unlike("\x90", qr/[i-\N{LATIN SMALL LETTER J}]/, '"\x90" shouldnt 
match [i-\N{LATIN SMALL LETTER J}]');
+        like("\x91", qr/[i-\N{LATIN SMALL LETTER J}]/, '"\x91" should match 
[i-\N{LATIN SMALL LETTER J}]');
+
+        like("\x89", qr/[i-\N{U+6A}]/, '"\x89" should match [i-\N{U+6A}]');
+        unlike("\x8A", qr/[i-\N{U+6A}]/, '"\x8A" shouldnt match [i-\N{U+6A}]');
+        unlike("\x90", qr/[i-\N{U+6A}]/, '"\x90" shouldnt match [i-\N{U+6A}]');
+        like("\x91", qr/[i-\N{U+6A}]/, '"\x91" should match [i-\N{U+6A}]');
+
+        like("\x89", qr/[\N{U+69}-\N{U+6A}]/, '"\x89" should match 
[\N{U+69}-\N{U+6A}]');
+        unlike("\x8A", qr/[\N{U+69}-\N{U+6A}]/, '"\x8A" shouldnt match 
[\N{U+69}-\N{U+6A}]');
+        unlike("\x90", qr/[\N{U+69}-\N{U+6A}]/, '"\x90" shouldnt match 
[\N{U+69}-\N{U+6A}]');
+        like("\x91", qr/[\N{U+69}-\N{U+6A}]/, '"\x91" should match 
[\N{U+69}-\N{U+6A}]');
+
+        like("\x89", qr/[i-\x{91}]/, '"\x89" should match [i-\x{91}]');
+        like("\x8A", qr/[i-\x{91}]/, '"\x8A" should match [i-\x{91}]');
+        like("\x90", qr/[i-\x{91}]/, '"\x90" should match [i-\x{91}]');
+        like("\x91", qr/[i-\x{91}]/, '"\x91" should match [i-\x{91}]');
+
+        # Need to use eval, because tries to compile on ASCII platforms even
+        # though the tests are skipped, and fails because 0x89-j is an illegal
+        # range there.
+        like("\x89", eval "qr/[\x{89}-j]/", '"\x89" should match [\x{89}-j]');
+        like("\x8A", eval "qr/[\x{89}-j]/", '"\x8A" should match [\x{89}-j]');
+        like("\x90", eval "qr/[\x{89}-j]/", '"\x90" should match [\x{89}-j]');
+        like("\x91", eval "qr/[\x{89}-j]/", '"\x91" should match [\x{89}-j]');
+    }
 
     # These are based on looking at the code in regcomp.c
     # We don't look for specific code, just the existence of an SSC
@@ -1577,6 +1615,11 @@ EOP
                   undef,
                  '\p{} should not fail silently when uni tables evanesce');
     }
+
+    {   # Special handling of literal-ended ranges in [...] was breaking this
+        use utf8;
+        like("Ã¿", qr/[Ã¿-Ã¿]/, "\"Ã¿\" should match [Ã¿-Ã¿]");
+    }
 } # End of sub run_tests
 
 1;
diff --git a/t/re/reg_mesg.t b/t/re/reg_mesg.t
index 5162aac..126a427 100644
--- a/t/re/reg_mesg.t
+++ b/t/re/reg_mesg.t
@@ -43,11 +43,14 @@ sub fixup_expect {
 }
 
 ## Because we don't "use utf8" in this file, we need to do some extra legwork
-## for the utf8 tests: Append 'use utf8' to the pattern, and mark the strings
-## to check against as UTF-8
+## for the utf8 tests: Prepend 'use utf8' to the pattern, and mark the strings
+## to check against as UTF-8, but for this all to work properly, the character
+## 'ã' (U+30CD) is required in each pattern somewhere as a marker.
 ##
 ## This also creates a second variant of the tests to check if the
-## latin1 error messages are working correctly.
+## latin1 error messages are working correctly.  Because we don't 'use utf8',
+## we can't tell if something is UTF-8 or Latin1, so you need the suffix
+## '; no latin1' to not have the second variant.
 my $l1   = "\x{ef}";
 my $utf8 = "\x{30cd}";
 utf8::encode($utf8);
@@ -58,7 +61,7 @@ sub mark_as_utf8 {
         my $l1_pat = $pat =~ s/$utf8/$l1/gr;
         my $l1_msg;
         $pat = "use utf8; $pat";
-        
+
         if (ref $msg) {
             $l1_msg = [ map { s/$utf8/$l1/gr } @$msg ];
             @$msg   = map { my $c = $_; utf8::decode($c); $c } @$msg;
@@ -68,6 +71,7 @@ sub mark_as_utf8 {
             utf8::decode($msg);
         }
         push @ret, $pat => $msg;
+
         push @ret, $l1_pat => $l1_msg unless $l1_pat =~ /#no latin1/;
     }
     return @ret;
@@ -236,6 +240,7 @@ my @death =
  '/((?# This is a comment in the middle of a token)*FAIL)/' => 'In 
\'(*VERB...)\', the \'(\' and \'*\' must be adjacent {#} m/((?# This is a 
comment in the middle of a token)*{#}FAIL)/',
 );
 
+# These need the character 'ã' as a marker for mark_as_utf8()
 my @death_utf8 = mark_as_utf8(
  '/ã[[=ã=]]ã/' => 'POSIX syntax [= =] is reserved for future extensions 
{#} m/ã[[=ã=]{#}]ã/',
  '/ã(?<= .*)/' =>  'Variable length lookbehind not implemented in regex 
m/ã(?<= .*)/',
@@ -404,6 +409,7 @@ my @warning = (
     '/b{3}  +\x{100}/x' => 'Useless use of greediness modifier \'+\' {#} 
m/b{3}  +{#}\x{100}/',
 ); # See comments before this for why '\x{100}' is generally needed
 
+# These need the character 'ã' as a marker for mark_as_utf8()
 my @warnings_utf8 = mark_as_utf8(
     'm/ã\b*ã/' => '\b* matches null string many times {#} m/ã\b*{#}ã/',
     '/(?=ã)*/' => '(?=ã)* matches null string many times {#} 
m/(?=ã)*{#}/',
@@ -415,7 +421,7 @@ my @warnings_utf8 = mark_as_utf8(
     '/ã[ã-[:digit:]]ã/' => 'False [] range "ã-[:digit:]" {#} 
m/ã[ã-[:digit:]{#}]ã/',
     '/ã[\d-\s]ã/' => 'False [] range "\d-" {#} m/ã[\d-{#}\s]ã/',
     '/ã[a\zb]ã/' => 'Unrecognized escape \z in character class passed 
through {#} m/ã[a\z{#}b]ã/',
-    '/ã(?c)ã/' => 'Useless (?c) - use /gc modifier {#} m/ã(?c{#})ã/',  
  
+    '/ã(?c)ã/' => 'Useless (?c) - use /gc modifier {#} m/ã(?c{#})ã/',
     '/utf8 ã (?ogc) ã/' => [
         'Useless (?o) - use /o modifier {#} m/utf8 ã (?o{#}gc) ã/',
         'Useless (?g) - use /g modifier {#} m/utf8 ã (?og{#}c) ã/',
diff --git a/toke.c b/toke.c
index 39e0b79..09ef5b2 100644
--- a/toke.c
+++ b/toke.c
@@ -3084,6 +3084,7 @@ S_scan_const(pTHX_ char *start)
             * symbol meaning, e.g. \x{2E} would be confused with a dot.  But
             * in spite of this, we do have to process \N here while the proper
             * charnames handler is in scope.  See bugs #56444 and #62056.
+             *
             * There is a complication because \N in a pattern may also stand
             * for 'match a non-nl', and not mean a charname, in which case its
             * processing should be deferred to the regex compiler.  To be a
@@ -3218,31 +3219,44 @@ S_scan_const(pTHX_ char *start)
                continue;
 
            case 'N':
-               /* In a non-pattern \N must be a named character, like \N{LATIN
-                * SMALL LETTER A} or \N{U+0041}.  For patterns, it also can
-                * mean to match a non-newline.  For non-patterns, named
-                * characters are converted to their string equivalents. In
-                * patterns, named characters are not converted to their
-                * ultimate forms for the same reasons that other escapes
-                * aren't.  Instead, they are converted to the \N{U+...} form
-                * to get the value from the charnames that is in effect right
-                * now, while preserving the fact that it was a named character
-                * so that the regex compiler knows this */
-
-               /* The structure of this section of code (besides checking for
+                /* In a non-pattern \N must be like \N{U+0041}, or it can be a
+                 * named character, like \N{LATIN SMALL LETTER A}, or a named
+                 * sequence, like \N{LATIN CAPITAL LETTER A WITH MACRON AND
+                 * GRAVE}.  For convenience all three forms are referred to as
+                 * "named characters" below.
+                 *
+                 * For patterns, \N also can mean to match a non-newline.  Code
+                 * before this 'switch' statement should already have handled
+                 * this situation, and hence this code only has to deal with
+                 * the named character cases.
+                 *
+                 * For non-patterns, the named characters are converted to
+                 * their string equivalents.  In patterns, named characters are
+                 * not converted to their ultimate forms for the same reasons
+                 * that other escapes aren't.  Instead, they are converted to
+                 * the \N{U+...} form to get the value from the charnames that
+                 * is in effect right now, while preserving the fact that it
+                 * was a named character, so that the regex compiler knows
+                 * this.
+                 *
+                * The structure of this section of code (besides checking for
                 * errors and upgrading to utf8) is:
-                *  Further disambiguate between the two meanings of \N, and if
-                *      not a charname, go process it elsewhere
-                *  If of form \N{U+...}, pass it through if a pattern;
-                *      otherwise convert to utf8
-                *  Otherwise must be \N{NAME}: convert to \N{U+c1.c2...} if a
-                *  pattern; otherwise convert to utf8 */
-
-               /* Here, s points to the 'N'; the test below is guaranteed to
-                * succeed if we are being called on a pattern as we already
-                * know from a test above that the next character is a '{'.
-                * On a non-pattern \N must mean 'named sequence, which
-                * requires braces */
+                 *  If the named character is of the form \N{U+...}, pass it
+                 *      through if a pattern; otherwise convert the code point
+                 *      to utf8
+                 *  Otherwise must be some \N{NAME}: convert to \N{U+c1.c2...}
+                 *      if a pattern; otherwise convert to utf8
+                 *
+                 * If the regex compiler should ever need to differentiate
+                 * between the \N{U+...} and \N{name} forms, that could easily
+                 * be done here by stripping any leading zeros from the
+                 * \N{U+...} case, and adding them to the other one. */
+
+                /* Here, 's' points to the 'N'; the test below is guaranteed to
+                * succeed if we are being called on a pattern, as we already
+                 * know from a test above that the next character is a '{'.  A
+                 * non-pattern \N must mean 'named character', which requires
+                 * braces */
                s++;
                if (*s != '{') {
                    yyerror("Missing braces on \\N{}"); 
@@ -3267,8 +3281,6 @@ S_scan_const(pTHX_ char *start)
                                | PERL_SCAN_DISALLOW_PREFIX;
                    STRLEN len;
 
-                   /* For \N{U+...}, the '...' is a unicode value even on
-                    * EBCDIC machines */
                    s += 2;         /* Skip to next char after the 'U+' */
                    len = e - s;
                    uv = grok_hex(s, &len, &flags, NULL);
@@ -3279,27 +3291,26 @@ S_scan_const(pTHX_ char *start)
                    }
 
                    if (PL_lex_inpat) {
-
-                       /* On non-EBCDIC platforms, pass through to the regex
-                        * compiler unchanged.  The reason we evaluated the
-                        * number above is to make sure there wasn't a syntax
-                        * error.  But on EBCDIC we convert to native so
-                        * downstream code can continue to assume it's native
-                        */
                        s -= 5;     /* Include the '\N{U+' */
 #ifdef EBCDIC
-                       d += my_snprintf(d, e - s + 1 + 1,  /* includes the }
+                        /* On EBCDIC platforms, in \N{U+...}, the '...' is a
+                         * Unicode value, so convert to native so downstream
+                         * code can continue to assume it's native */
+                       d += my_snprintf(d, e - s + 1 + 1,  /* includes the '}'
                                                               and the \0 */
-                                   "\\N{U+%X}",
-                                   (unsigned int) UNI_TO_NATIVE(uv));
+                                         "\\N{U+%X}",
+                                         (unsigned int) UNI_TO_NATIVE(uv));
 #else
-                       Copy(s, d, e - s + 1, char);    /* 1 = include the } */
+                        /* On non-EBCDIC platforms, pass it through unchanged.
+                         * The reason we evaluated the number above is to make
+                         * sure there wasn't a syntax error. */
+                       Copy(s, d, e - s + 1, char);    /* +1 is for the '}' */
                        d += e - s + 1;
 #endif
                    }
                    else {  /* Not a pattern: convert the hex to string */
 
-                        /* If destination is not in utf8, unconditionally
+                         /* If the destination is not in utf8, unconditionally
                          * recode it to be so.  This is because \N{} implies
                          * Unicode semantics, and scalars have to be in utf8
                          * to guarantee those semantics */
@@ -3352,13 +3363,18 @@ S_scan_const(pTHX_ char *start)
                                  * through the string.  Each character takes up
                                  * 2 hex digits plus either a trailing dot or
                                  * the "}" */
+                                const char initial_text[] = "\\N{U+";
+                                const STRLEN initial_len = sizeof(initial_text)
+                                                           - 1;
                                 d = off + SvGROW(sv, off
                                                     + 3 * len
-                                                    + 6 /* For the "\N{U+", and
-                                                           trailing NUL */
+
+                                                    /* +1 for trailing NUL */
+                                                    + initial_len + 1
+
                                                     + (STRLEN)(send - e));
-                                Copy("\\N{U+", d, 5, char);
-                                d += 5;
+                                Copy(initial_text, d, initial_len, char);
+                                d += initial_len;
                                 while (str < str_end) {
                                     char hex_string[4];
                                     int len =
@@ -3370,7 +3386,7 @@ S_scan_const(pTHX_ char *start)
                                     d += 3;
                                     str++;
                                 }
-                                d--;    /* We will overwrite below the final
+                                d--;    /* Below, we will overwrite the final
                                            dot with a right brace */
                             }
                             else {

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.21.4-272-g5920672

Reply via email to