[perl.git] branch blead, updated. v5.17.2-304-g97766bb

Karl Williamson Sat, 11 Aug 2012 18:33:26 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/97766bb9a4aedb56c70924722c7fa2d0b2a744e4?hp=9ffe0f2ee36ba437b90034fb59b118abea257977>


- Log -----------------------------------------------------------------
commit 97766bb9a4aedb56c70924722c7fa2d0b2a744e4
Author: Karl Williamson <[email protected]>
Date:   Sat Aug 11 14:56:55 2012 -0600

    mktables: Rebuild if local Makefile has changed
    
    Normally, mktables is called from the Makefile at the base level.  But
    during development, it may manually be called from the directory (and
    hence that directory's Makefile).  This patch causes it to rebuild if
    that Makefile changes.

M       lib/unicore/mktables

commit 25e26d771831dbf46711953edf87195c78f38a10
Author: Karl Williamson <[email protected]>
Date:   Sat Aug 11 14:30:02 2012 -0600

    perlre: Nits
    
    This fixes some grammar ("either" legally should refer to only a
    dual-valued option set) and removes unnecessary distracting detail.

M       pod/perlre.pod

commit 7dff6b46615bc7810e4e66f957fe993b1c0a968d
Author: Karl Williamson <[email protected]>
Date:   Sat Aug 11 14:19:45 2012 -0600

    regcomp.c: Optimization not valid for Latin Sharp S
    
    The regex optimizer optimizes some quantifier expressions into simpler
    versions.  It turns out that these optimizations don't work on a
    quantified, folded LATIN SMALL LETTER SHARP S under /d.  This is due to
    the size differential of the fold from the source.
    
    This commit omits the optimization if this circumstance occurs anywhere
    in the regex prior to the determination of whether to optimize or not.
    I tried adding a parameter to study_chunk() to indicate more locally if
    the optimization should be excluded or not; but my first attempt did not
    fix the bug, and I chose to not pursue that line.  This character is so
    abnormal that it's probably best anyway to be overly cautious when
    confronted with it.

M       regcomp.c
M       t/re/re_tests

commit 7d103382d5bb85c8a8b1caa5ada8ea405c895a41
Author: Karl Williamson <[email protected]>
Date:   Sat Aug 11 14:10:05 2012 -0600

    regcomp.c: Extract duplicate code to common function
    
    Comments warned about keeping the two code sections in sync; this commit
    takes the portions that are identical and makes a common function out of
    them, so the synchronization becomes automatic.

M       regcomp.c

commit 2111a35c733ad351fd039ae6ebb7df6cb0a55c94
Author: Karl Williamson <[email protected]>
Date:   Fri Aug 10 12:16:45 2012 -0600

    regcomp.c: Make sure counter same in passes 1 and 2
    
    The number of elements was not being incremented in pass 1, whereas that
    number is needed later on in pass 1.  This did not cause a
    bug, as currently, in pass 1 we care only if the count is 1 or not, and
    this occurred only in a case where it would get incremented properly to
    more than 1 anyway.  But this is a potential bug that should be
    squelched before it happens.

M       regcomp.c

commit 6012a526f12c2cd0214843a2ed38eede87c28d58
Author: Karl Williamson <[email protected]>
Date:   Fri Aug 10 11:58:49 2012 -0600

    regcomp.c: Comments only
    
    The diffs will show more than this, as a block of comments was moved and
    revised

M       regcomp.c

commit 70ce1368b07e8958e7d4cb38d714ae744eed1bff
Author: Karl Williamson <[email protected]>
Date:   Fri Aug 10 11:53:20 2012 -0600

    regcomp.c: Use old paradigm in dealing with flags recursively
    
    In a recursive call to reg(), instead of passing our flags pointer, pass
    a new one and upon return or in that result with the existing one.  I
    can see why this should be done, as you don't want to lose what you
    already have, as reg() will start by resetting it to 0.  I don't know
    why one ands it with the known flags, but I'm presuming there is a
    reason, and so am copying the paradigm.  I searched the commit messages
    and didn't find anything.  No tests failed, and I didn't figure out a
    test that would fail.

M       regcomp.c

commit 0ccf8511670a0328f6f0142c36308afb3be61ba2
Author: Karl Williamson <[email protected]>
Date:   Fri Aug 10 09:11:11 2012 -0600

    regcomp.c: Create NOTHING node when would have been 0 length EXACT
    
    It's peculiar circumstances indeed that would get to this point in the
    code with an EXACT node to be created, but nothing to populate it with.
    Perhaps it is impossible; I'm not sure.  But commit
    5f820f894e71b6970a5aa0fd763a84b647fd628a changed the behavior, which I
    discovered in later re-reading the code.  Probably the node would be
    populated with a single NUL.  Just in case it is possible to get here
    under these peculiar circumstances, this commit adds code to handle the
    case, with a NOTHING node instead of a 0 length EXACT.

M       regcomp.c

commit 34b39fc9cd81fbff0d52451a5c4570293817ca32
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 9 14:38:03 2012 -0600

    regcomp.c: Set flags when optimizing a [char class]
    
    A bracketed character class containing a single Latin1-range character
    has long been optimized into an EXACT node.  Also, flags are set to
    include SIMPLE.  However, EXACT nodes containing code points that are
    different when encoded under UTF-8 versus not UTF-8 should not be marked
    simple.
    
    To fix this, the address of the flags parameter is now passed to
    regclass(), the function that parses bracketed character classes, which
    now sets it appropriately.  The unconditional setting of SIMPLE that was
    always done in the code after calling regclass() has been removed.
    
    In addition, the setting of the flags for EXACT nodes has been pushed
    into the common function that populates them.
    
    regclass() will also now increment the naughtiness count if optimized to
    a node that normally does that.  I do not understand this heuristic
    behavior very well, and could not come up with a test case for it;
    experimentation revealed that there are no test cases in our test suite
    for which naughtiness makes any difference at all.

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c
M       t/re/pat.t

commit 78412e20d4b28d3841c99754748aa4a28fe204d3
Author: Karl Williamson <[email protected]>
Date:   Tue Aug 7 21:06:06 2012 -0600

    regcomp.c: change pattern to utf8 if needed in \N{}
    
    This patch is in preparation for future patches that will no longer
    always make any pattern that contains \N{} be encoded in UTF-8.  Thus
    this patch doesn't actually change anything, but enables future ones.

M       regcomp.c

commit 8aa72352894d87fd71e434e16412ddf35fb57bda
Author: Karl Williamson <[email protected]>
Date:   Mon Aug 6 16:42:27 2012 -0600

    re/re_tests: Correct Todo test
    
    This test was not doing what it purported to test.  It should show that
    a /[s\xDF]/i would not match 'ss', because the 's' is seen in the class,
    and not the \xDF (which matches 'ss' under /i) in the appropriate
    strings

M       t/re/re_tests

commit 48fe68f5f0737af510b78c4c2219d284f18f4c3e
Author: Karl Williamson <[email protected]>
Date:   Sat Aug 4 11:02:16 2012 -0600

    re.pm: Nits in pod
    
    This has clarifications, grammar changes, and reflowing to fit into 79
    columns

M       ext/re/re.pm
M       t/porting/known_pod_issues.dat

commit 67048c1b9c69bd5dcb7e1bb7537dccc3ff3f16b1
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 2 10:50:00 2012 -0600

    Add some tests for [\N{}]

M       t/re/pat_advanced.t
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                      |    8 +-
 embed.h                        |    4 +-
 ext/re/re.pm                   |   38 ++++---
 lib/unicore/mktables           |    6 +-
 pod/perlre.pod                 |   10 +-
 proto.h                        |   19 ++--
 regcomp.c                      |  250 ++++++++++++++++++++++------------------
 t/porting/known_pod_issues.dat |    3 +-
 t/re/pat.t                     |    9 ++-
 t/re/pat_advanced.t            |   10 ++
 t/re/re_tests                  |    4 +-
 11 files changed, 213 insertions(+), 148 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 2b25b3c..64bc381 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1930,14 +1930,15 @@ Es      |regnode*|regbranch     |NN struct RExC_state_t 
*pRExC_state \
                                |NN I32 *flagp|I32 first|U32 depth
 Es     |STRLEN |reguni         |NN const struct RExC_state_t *pRExC_state \
                                |UV uv|NN char *s
-Es     |regnode*|regclass      |NN struct RExC_state_t *pRExC_state|U32 depth
+Es     |regnode*|regclass      |NN struct RExC_state_t *pRExC_state \
+                               |NN I32 *flagp|U32 depth
 Es     |regnode*|reg_node      |NN struct RExC_state_t *pRExC_state|U8 op
 Es     |UV     |reg_recode     |const char value|NN SV **encp
 Es     |regnode*|regpiece      |NN struct RExC_state_t *pRExC_state \
                                |NN I32 *flagp|U32 depth
 Es     |bool   |grok_bslash_N  |NN struct RExC_state_t *pRExC_state \
                                |NULLOK regnode** nodep|NULLOK UV *valuep \
-                               |NULLOK I32 *flagp|U32 depth|bool in_char_class
+                               |NN I32 *flagp|U32 depth|bool in_char_class
 Es     |void   |reginsert      |NN struct RExC_state_t *pRExC_state \
                                |U8 op|NN regnode *opnd|U32 depth
 Es     |void   |regtail        |NN struct RExC_state_t *pRExC_state \
@@ -1951,7 +1952,8 @@ Es        |U32    |join_exact     |NN struct RExC_state_t 
*pRExC_state \
 EsRn   |char * |regwhite       |NN struct RExC_state_t *pRExC_state \
                                |NN char *p
 Ei     |void   |alloc_maybe_populate_EXACT|NN struct RExC_state_t *pRExC_state 
\
-                               |NN regnode *node|STRLEN len|UV code_point
+                               |NN regnode *node|NN I32 *flagp|STRLEN len \
+                               |UV code_point
 Ei     |U8   |compute_EXACTish|NN struct RExC_state_t *pRExC_state
 Es     |char * |nextchar       |NN struct RExC_state_t *pRExC_state
 Es     |bool   |reg_skipcomment|NN struct RExC_state_t *pRExC_state
diff --git a/embed.h b/embed.h
index fd03e8a..8c81ee9 100644
--- a/embed.h
+++ b/embed.h
@@ -908,7 +908,7 @@
 #define add_alternate(a,b,c)   S_add_alternate(aTHX_ a,b,c)
 #define add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b)
 #define add_data               S_add_data
-#define alloc_maybe_populate_EXACT(a,b,c,d)    
S_alloc_maybe_populate_EXACT(aTHX_ a,b,c,d)
+#define alloc_maybe_populate_EXACT(a,b,c,d,e)  
S_alloc_maybe_populate_EXACT(aTHX_ a,b,c,d,e)
 #define checkposixcc(a)                S_checkposixcc(aTHX_ a)
 #define cl_and                 S_cl_and
 #define cl_anything            S_cl_anything
@@ -943,7 +943,7 @@
 #define reganode(a,b,c)                S_reganode(aTHX_ a,b,c)
 #define regatom(a,b,c)         S_regatom(aTHX_ a,b,c)
 #define regbranch(a,b,c,d)     S_regbranch(aTHX_ a,b,c,d)
-#define regclass(a,b)          S_regclass(aTHX_ a,b)
+#define regclass(a,b,c)                S_regclass(aTHX_ a,b,c)
 #define reginsert(a,b,c,d)     S_reginsert(aTHX_ a,b,c,d)
 #define regpiece(a,b,c)                S_regpiece(aTHX_ a,b,c)
 #define regpposixcc(a,b)       S_regpposixcc(aTHX_ a,b)
diff --git a/ext/re/re.pm b/ext/re/re.pm
index 1242595..f770839 100644
--- a/ext/re/re.pm
+++ b/ext/re/re.pm
@@ -4,7 +4,7 @@ package re;
 use strict;
 use warnings;
 
-our $VERSION     = "0.21";
+our $VERSION     = "0.22";
 our @ISA         = qw(Exporter);
 our @EXPORT_OK   = ('regmust',
                     qw(is_regexp regexp_pattern
@@ -235,14 +235,16 @@ re - Perl pragma to alter regular expression behaviour
 
     $pat = '(?{ $foo = 1 })';
     use re 'eval';
-    /foo${pat}bar/;               # won't fail (when not under -T switch)
+    /foo${pat}bar/;               # won't fail (when not under -T
+                                   # switch)
 
     {
        no re 'taint';             # the default
        ($x) = ($^X =~ /^(.*)$/s); # $x is not tainted here
 
        no re 'eval';              # the default
-       /foo${pat}bar/;            # disallowed (with or without -T switch)
+       /foo${pat}bar/;            # disallowed (with or without -T
+                                   # switch)
     }
 
     use re '/ix';
@@ -251,22 +253,27 @@ re - Perl pragma to alter regular expression behaviour
     "FOO" =~ /foo/; # just /i implied
 
     use re 'debug';               # output debugging info during
-    /^(.*)$/s;                    #     compile and run time
+    /^(.*)$/s;                    # compile and run time
 
 
-    use re 'debugcolor';          # same as 'debug', but with colored output
+    use re 'debugcolor';          # same as 'debug', but with colored
+                                   # output
     ...
 
-    use re qw(Debug All);          # Finer tuned debugging options.
-    use re qw(Debug More);
-    no re qw(Debug ALL);           # Turn of all re debugging in this scope
+    use re qw(Debug All);          # Same as "use re 'debug'", but you
+                                   # can use "Debug" with things other
+                                   # than 'All'
+    use re qw(Debug More);         # 'All' plus output more details
+    no re qw(Debug ALL);           # Turn on (almost) all re debugging
+                                   # in this scope
 
     use re qw(is_regexp regexp_pattern); # import utility functions
     my ($pat,$mods)=regexp_pattern(qr/foo/i);
     if (is_regexp($obj)) { 
         print "Got regexp: ",
-            scalar regexp_pattern($obj); # just as perl would stringify it
-    }                                    # but no hassle with blessed re's.
+            scalar regexp_pattern($obj); # just as perl would stringify
+    }                                    # it but no hassle with blessed
+                                         # re's.
 
 (We use $^X in these examples because it's tainted by default.)
 
@@ -409,7 +416,7 @@ Extra debugging of how tries execute.
 
 =item INTUIT
 
-Enable debugging of start point optimisations.
+Enable debugging of start-point optimisations.
 
 =back
 
@@ -443,7 +450,7 @@ states as well. This output from this can be quite large.
 
 =item OPTIMISEM
 
-Enable enhanced optimisation debugging and start point optimisations.
+Enable enhanced optimisation debugging and start-point optimisations.
 Probably not useful except when debugging the regexp engine itself.
 
 =item OFFSETS
@@ -476,7 +483,8 @@ These are useful shortcuts to save on the typing.
 
 =item ALL
 
-Enable all options at once except OFFSETS, OFFSETSDBG and BUFFERS
+Enable all options at once except OFFSETS, OFFSETSDBG and BUFFERS.
+(To get every single option without exception, use both ALL and EXTRA.)
 
 =item All
 
@@ -488,14 +496,14 @@ Enable DUMP and all execute options. Equivalent to:
 
 =item More
 
-Enable TRIEM and all execute compile and execute options.
+Enable the options enabled by "All", plus STATE, TRIEC, and TRIEM.
 
 =back
 
 =back
 
 As of 5.9.5 the directive C<use re 'debug'> and its equivalents are
-lexically scoped, as the other directives are.  However they have both
+lexically scoped, as are the other directives.  However they have both
 compile-time and run-time effects.
 
 =head2 Exportable Functions
diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index 5b49f00..0be8b5b 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -17622,7 +17622,7 @@ END
 
 # Create the list of input files from the objects we have defined, plus
 # version
-my @input_files = 'version';
+my @input_files = qw(version Makefile);
 foreach my $object (@input_file_objects) {
     my $file = $object->file;
     next if ! defined $file;    # Not all objects have files
@@ -17653,6 +17653,10 @@ foreach my $in (@input_files) {
     }
 }
 
+# We use 'Makefile' just to see if it has changed since the last time we
+# rebuilt.  Now discard it.
+@input_files = grep { $_ ne 'Makefile' } @input_files;
+
 my $rebuild = $write_unchanged_files    # Rebuild: if unconditional rebuild
               || ! scalar @mktables_list_output_files  # or if no outputs known
               || $old_start_time < $most_recent;       # or out-of-date
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 6bb20e7..6a073ba 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -1521,11 +1521,11 @@ Conditional expression. Matches C<yes-pattern> if 
C<condition> yields
 a true value, matches C<no-pattern> otherwise. A missing pattern always
 matches.
 
-C<(condition)> should be either an integer in
+C<(condition)> should be one of: 1) an integer in
 parentheses (which is valid if the corresponding pair of parentheses
-matched), a look-ahead/look-behind/evaluate zero-width assertion, a
+matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
 name in angle brackets or single quotes (which is valid if a group
-with the given name matched), or the special symbol (R) (true when
+with the given name matched); or 4) the special symbol (R) (true when
 evaluated inside of recursion or eval). Additionally the R may be
 followed by a number, (which will be true when evaluated when recursing
 inside of the appropriate group), or by C<&NAME>, in which case it will
@@ -1923,11 +1923,11 @@ is the same as
 
 but
 
-  / ( A (*THEN) B | C (*THEN) D ) /
+  / ( A (*THEN) B | C ) /
 
 is not the same as
 
-  / ( A (*PRUNE) B | C (*PRUNE) D ) /
+  / ( A (*PRUNE) B | C ) /
 
 as after matching the A but failing on the B the C<(*THEN)> verb will
 backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
diff --git a/proto.h b/proto.h
index 0a3e547..78fd691 100644
--- a/proto.h
+++ b/proto.h
@@ -6394,11 +6394,12 @@ STATIC U32      S_add_data(struct RExC_state_t 
*pRExC_state, U32 n, const char *s)
 #define PERL_ARGS_ASSERT_ADD_DATA      \
        assert(pRExC_state); assert(s)
 
-PERL_STATIC_INLINE void        S_alloc_maybe_populate_EXACT(pTHX_ struct 
RExC_state_t *pRExC_state, regnode *node, STRLEN len, UV code_point)
+PERL_STATIC_INLINE void        S_alloc_maybe_populate_EXACT(pTHX_ struct 
RExC_state_t *pRExC_state, regnode *node, I32 *flagp, STRLEN len, UV code_point)
                        __attribute__nonnull__(pTHX_1)
-                       __attribute__nonnull__(pTHX_2);
+                       __attribute__nonnull__(pTHX_2)
+                       __attribute__nonnull__(pTHX_3);
 #define PERL_ARGS_ASSERT_ALLOC_MAYBE_POPULATE_EXACT    \
-       assert(pRExC_state); assert(node)
+       assert(pRExC_state); assert(node); assert(flagp)
 
 STATIC void    S_checkposixcc(pTHX_ struct RExC_state_t *pRExC_state)
                        __attribute__nonnull__(pTHX_1);
@@ -6466,9 +6467,10 @@ PERL_STATIC_INLINE UV*   S_get_invlist_zero_addr(pTHX_ 
SV* invlist)
        assert(invlist)
 
 STATIC bool    S_grok_bslash_N(pTHX_ struct RExC_state_t *pRExC_state, 
regnode** nodep, UV *valuep, I32 *flagp, U32 depth, bool in_char_class)
-                       __attribute__nonnull__(pTHX_1);
+                       __attribute__nonnull__(pTHX_1)
+                       __attribute__nonnull__(pTHX_4);
 #define PERL_ARGS_ASSERT_GROK_BSLASH_N \
-       assert(pRExC_state)
+       assert(pRExC_state); assert(flagp)
 
 PERL_STATIC_INLINE UV* S_invlist_array(pTHX_ SV* const invlist)
                        __attribute__warn_unused_result__
@@ -6607,10 +6609,11 @@ STATIC regnode* S_regbranch(pTHX_ struct RExC_state_t 
*pRExC_state, I32 *flagp,
 #define PERL_ARGS_ASSERT_REGBRANCH     \
        assert(pRExC_state); assert(flagp)
 
-STATIC regnode*        S_regclass(pTHX_ struct RExC_state_t *pRExC_state, U32 
depth)
-                       __attribute__nonnull__(pTHX_1);
+STATIC regnode*        S_regclass(pTHX_ struct RExC_state_t *pRExC_state, I32 
*flagp, U32 depth)
+                       __attribute__nonnull__(pTHX_1)
+                       __attribute__nonnull__(pTHX_2);
 #define PERL_ARGS_ASSERT_REGCLASS      \
-       assert(pRExC_state)
+       assert(pRExC_state); assert(flagp)
 
 STATIC void    S_reginsert(pTHX_ struct RExC_state_t *pRExC_state, U8 op, 
regnode *opnd, U32 depth)
                        __attribute__nonnull__(pTHX_1)
diff --git a/regcomp.c b/regcomp.c
index 09a8f27..7e7d229 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -2661,7 +2661,7 @@ S_make_trie_failtable(pTHX_ RExC_state_t *pRExC_state, 
regnode *source,  regnode
  *      'ss' or not is not knowable at compile time.  It will match iff the
  *      target string is in UTF-8, unlike the EXACTFU nodes, where it always
  *      matches; and the EXACTFL and EXACTFA nodes where it never does.  Thus
- *      it can't be folded to "ss" at compile time, unlike EXACTFU does as
+ *      it can't be folded to "ss" at compile time, unlike EXACTFU does (as
  *      described in item 3).  An assumption that the optimizer part of
  *      regexec.c (probably unwittingly) makes is that a character in the
  *      pattern corresponds to at most a single character in the target string.
@@ -3663,7 +3663,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode 
**scanp,
                uc = utf8_to_uvchr_buf(s, s + l, NULL);
                l = utf8_length(s, s + l);
            }
-           else if (has_exactf_sharp_s) {
+           if (has_exactf_sharp_s) {
                RExC_seen |= REG_SEEN_EXACTF_SHARP_S;
            }
            min += l - min_subtract;
@@ -3960,6 +3960,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode 
**scanp,
                      && !(data->flags & SF_HAS_EVAL)
                      && !deltanext     /* atom is fixed width */
                      && minnext != 0   /* CURLYM can't handle zero width */
+                      && ! (RExC_seen & REG_SEEN_EXACTF_SHARP_S) /* Nor \xDF */
                ) {
                    /* XXXX How to optimize if data == 0? */
                    /* Optimize to a simpler form.  */
@@ -5206,6 +5207,50 @@ S_compile_runtime_code(pTHX_ RExC_state_t * const 
pRExC_state,
 }
 
 
+STATIC bool
+S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, 
SV** rx_substr, I32* rx_end_shift, I32 lookbehind, I32 offset, I32 *minlen, 
STRLEN longest_length, bool eol, bool meol)
+{
+    /* This is the common code for setting up the floating and fixed length
+     * string data extracted from Perlre_op_compile() below.  Returns a boolean
+     * as to whether succeeded or not */
+
+    I32 t,ml;
+
+    if (! (longest_length
+           || (eol /* Can't have SEOL and MULTI */
+               && (! meol || (RExC_flags & RXf_PMf_MULTILINE)))
+          )
+            /* See comments for join_exact for why REG_SEEN_EXACTF_SHARP_S */
+        || (RExC_seen & REG_SEEN_EXACTF_SHARP_S))
+    {
+        return FALSE;
+    }
+
+    /* copy the information about the longest from the reg_scan_data
+        over to the program. */
+    if (SvUTF8(sv_longest)) {
+        *rx_utf8 = sv_longest;
+        *rx_substr = NULL;
+    } else {
+        *rx_substr = sv_longest;
+        *rx_utf8 = NULL;
+    }
+    /* end_shift is how many chars that must be matched that
+        follow this item. We calculate it ahead of time as once the
+        lookbehind offset is added in we lose the ability to correctly
+        calculate it.*/
+    ml = minlen ? *(minlen) : (I32)longest_length;
+    *rx_end_shift = ml - offset
+        - longest_length + (SvTAIL(sv_longest) != 0)
+        + lookbehind;
+
+    t = (eol/* Can't have SEOL and MULTI */
+         && (! meol || (RExC_flags & RXf_PMf_MULTILINE)));
+    fbm_compile(sv_longest, t ? FBMcf_TAIL : 0);
+
+    return TRUE;
+}
+
 /*
  * Perl_re_op_compile - the perl internal RE engine's function to compile a
  * regular expression into internal code.
@@ -6173,105 +6218,56 @@ reStudy:
        scan_commit(pRExC_state, &data,&minlen,0);
        SvREFCNT_dec(data.last_found);
 
-        /* Note that code very similar to this but for anchored string 
-           follows immediately below, changes may need to be made to both. 
-           Be careful. 
-         */
        longest_float_length = CHR_SVLEN(data.longest_float);
-       if (longest_float_length
-           || (data.flags & SF_FL_BEFORE_EOL
-               && (!(data.flags & SF_FL_BEFORE_MEOL)
-                   || (RExC_flags & RXf_PMf_MULTILINE)))) 
-        {
-            I32 t,ml;
 
-            /* See comments for join_exact for why REG_SEEN_EXACTF_SHARP_S */
-           if ((RExC_seen & REG_SEEN_EXACTF_SHARP_S)
-               || (SvCUR(data.longest_fixed)  /* ok to leave SvCUR */
-                   && data.offset_fixed == data.offset_float_min
-                   && SvCUR(data.longest_fixed) == SvCUR(data.longest_float)))
-                   goto remove_float;          /* As in (a)+. */
-
-            /* copy the information about the longest float from the 
reg_scan_data
-               over to the program. */
-           if (SvUTF8(data.longest_float)) {
-               r->float_utf8 = data.longest_float;
-               r->float_substr = NULL;
-           } else {
-               r->float_substr = data.longest_float;
-               r->float_utf8 = NULL;
-           }
-           /* float_end_shift is how many chars that must be matched that 
-              follow this item. We calculate it ahead of time as once the
-              lookbehind offset is added in we lose the ability to correctly
-              calculate it.*/
-           ml = data.minlen_float ? *(data.minlen_float) 
-                                  : (I32)longest_float_length;
-           r->float_end_shift = ml - data.offset_float_min
-               - longest_float_length + (SvTAIL(data.longest_float) != 0)
-               + data.lookbehind_float;
+        if (! ((SvCUR(data.longest_fixed)  /* ok to leave SvCUR */
+                   && data.offset_fixed == data.offset_float_min
+                   && SvCUR(data.longest_fixed) == SvCUR(data.longest_float)))
+            && S_setup_longest (aTHX_ pRExC_state,
+                                    data.longest_float,
+                                    &(r->float_utf8),
+                                    &(r->float_substr),
+                                    &(r->float_end_shift),
+                                    data.lookbehind_float,
+                                    data.offset_float_min,
+                                    data.minlen_float,
+                                    longest_float_length,
+                                    data.flags & SF_FL_BEFORE_EOL,
+                                    data.flags & SF_FL_BEFORE_MEOL))
+        {
            r->float_min_offset = data.offset_float_min - data.lookbehind_float;
            r->float_max_offset = data.offset_float_max;
            if (data.offset_float_max < I32_MAX) /* Don't offset infinity */
                r->float_max_offset -= data.lookbehind_float;
-           
-           t = (data.flags & SF_FL_BEFORE_EOL /* Can't have SEOL and MULTI */
-                      && (!(data.flags & SF_FL_BEFORE_MEOL)
-                          || (RExC_flags & RXf_PMf_MULTILINE)));
-           fbm_compile(data.longest_float, t ? FBMcf_TAIL : 0);
        }
        else {
-         remove_float:
            r->float_substr = r->float_utf8 = NULL;
            SvREFCNT_dec(data.longest_float);
            longest_float_length = 0;
        }
 
-        /* Note that code very similar to this but for floating string 
-           is immediately above, changes may need to be made to both. 
-           Be careful. 
-         */
        longest_fixed_length = CHR_SVLEN(data.longest_fixed);
 
-        /* See comments for join_exact for why REG_SEEN_EXACTF_SHARP_S */
-       if (! (RExC_seen & REG_SEEN_EXACTF_SHARP_S)
-           && (longest_fixed_length
-               || (data.flags & SF_FIX_BEFORE_EOL /* Cannot have SEOL and 
MULTI */
-                   && (!(data.flags & SF_FIX_BEFORE_MEOL)
-                       || (RExC_flags & RXf_PMf_MULTILINE)))) )
+        if (S_setup_longest (aTHX_ pRExC_state,
+                                data.longest_fixed,
+                                &(r->anchored_utf8),
+                                &(r->anchored_substr),
+                                &(r->anchored_end_shift),
+                                data.lookbehind_fixed,
+                                data.offset_fixed,
+                                data.minlen_fixed,
+                                longest_fixed_length,
+                                data.flags & SF_FIX_BEFORE_EOL,
+                                data.flags & SF_FIX_BEFORE_MEOL))
         {
-            I32 t,ml;
-
-            /* copy the information about the longest fixed 
-               from the reg_scan_data over to the program. */
-           if (SvUTF8(data.longest_fixed)) {
-               r->anchored_utf8 = data.longest_fixed;
-               r->anchored_substr = NULL;
-           } else {
-               r->anchored_substr = data.longest_fixed;
-               r->anchored_utf8 = NULL;
-           }
-           /* fixed_end_shift is how many chars that must be matched that 
-              follow this item. We calculate it ahead of time as once the
-              lookbehind offset is added in we lose the ability to correctly
-              calculate it.*/
-            ml = data.minlen_fixed ? *(data.minlen_fixed) 
-                                   : (I32)longest_fixed_length;
-            r->anchored_end_shift = ml - data.offset_fixed
-               - longest_fixed_length + (SvTAIL(data.longest_fixed) != 0)
-               + data.lookbehind_fixed;
            r->anchored_offset = data.offset_fixed - data.lookbehind_fixed;
-
-           t = (data.flags & SF_FIX_BEFORE_EOL /* Can't have SEOL and MULTI */
-                && (!(data.flags & SF_FIX_BEFORE_MEOL)
-                    || (RExC_flags & RXf_PMf_MULTILINE)));
-           fbm_compile(data.longest_fixed, t ? FBMcf_TAIL : 0);
        }
        else {
            r->anchored_substr = r->anchored_utf8 = NULL;
            SvREFCNT_dec(data.longest_fixed);
            longest_fixed_length = 0;
        }
+
        if (ri->regstclass
            && (OP(ri->regstclass) == REG_ANY || OP(ri->regstclass) == SANY))
            ri->regstclass = NULL;
@@ -9577,20 +9573,22 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
     return(ret);
 }
 
-/* grok_bslash_N(pRExC_state, regnode** node_p, UV *valuep, UV depth, bool 
in_charclass)
+STATIC bool
+S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, 
I32 *flagp, U32 depth, bool in_char_class)
+{
    
-   This is expected to be called by a parser routine that has recognized '\N'
+ /* This is expected to be called by a parser routine that has recognized '\N'
    and needs to handle the rest. RExC_parse is expected to point at the first
    char following the N at the time of the call.  On successful return,
    RExC_parse has been updated to point to just after the sequence identified
-   by this routine.
+   by this routine, and <*flagp> has been updated.
 
-   The \N may be inside (indicated by the boolean <in_charclass>) or outside a
+   The \N may be inside (indicated by the boolean <in_char_class>) or outside a
    character class.
 
    \N may begin either a named sequence, or if outside a character class, mean
    to match a non-newline.  For non single-quoted regexes, the tokenizer has
-   attempted to decide which, and in the case of a named sequence converted it
+   attempted to decide which, and in the case of a named sequence, converted it
    into one of the forms: \N{} (if the sequence is null), or \N{U+c1.c2...},
    where c1... are the characters in the sequence.  For single-quoted regexes,
    the tokenizer passes the \N sequence through unchanged; this code will not
@@ -9622,9 +9620,6 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, 
U32 depth)
    null.
  */
 
-STATIC bool
-S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, regnode** node_p, UV *valuep, 
I32 *flagp, U32 depth, bool in_char_class)
-{
     char * endbrace;    /* '}' following the name */
     char* p;
     char *endchar;     /* Points to '.' or '}' ending cur char in the input
@@ -9774,6 +9769,7 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, 
regnode** node_p, UV *valuep, I
        SV * substitute_parse = newSVpvn_flags("?:", 2, SVf_UTF8|SVs_TEMP);
        STRLEN len;
        char *orig_end = RExC_end;
+        I32 flags;
 
        while (RExC_parse < endbrace) {
 
@@ -9799,7 +9795,8 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state, 
regnode** node_p, UV *valuep, I
        /* The values are Unicode, and therefore not subject to recoding */
        RExC_override_recoding = 1;
 
-       *node_p = reg(pRExC_state, 1, flagp, depth+1);
+       *node_p = reg(pRExC_state, 1, &flags, depth+1);
+       *flagp |= flags&(HASWIDTH|SPSTART|SIMPLE|POSTPONED);
 
        RExC_parse = endbrace;
        RExC_end = orig_end;
@@ -9866,15 +9863,23 @@ S_compute_EXACTish(pTHX_ RExC_state_t *pRExC_state)
 }
 
 PERL_STATIC_INLINE void
-S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, regnode *node, 
STRLEN len, UV code_point)
+S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, regnode *node, 
I32* flagp, STRLEN len, UV code_point)
 {
-    /* This knows the details about sizing an EXACTish node, and potentially
-     * populating it with a single character.  If <len> is non-zero, it assumes
-     * that the node has already been populated, and just does the sizing,
-     * ignoring <code_point>.  Otherwise it looks at <code_point> and
-     * calculates what <len> should be.  In pass 1, it sizes the node
-     * appropriately.  In pass 2, it additionally will populate the node's
-     * STRING with <code_point>, if <len> is 0.
+    /* This knows the details about sizing an EXACTish node, setting flags for
+     * it (by setting <*flagp>, and potentially populating it with a single
+     * character.
+     *
+     * If <len> is non-zero, this function assumes that the node has already
+     * been populated, and just does the sizing.  In this case <code_point>
+     * should be the final code point that has already been placed into the
+     * node.  This value will be ignored except that under some circumstances
+     * <*flagp> is set based on it.
+     *
+     * If <len is zero, the function assumes that the node is to contain only
+     * the single character given by <code_point> and calculates what <len>
+     * should be.  In pass 1, it sizes the node appropriately.  In pass 2, it
+     * additionally will populate the node's STRING with <code_point>, if <len>
+     * is 0.  In both cases <*flagp> is appropriately set
      *
      * It knows that under FOLD, UTF characters and the Latin Sharp S must be
      * folded (the latter only when the rules indicate it can match 'ss') */
@@ -9919,6 +9924,10 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t 
*pRExC_state, regnode *node, STR
             Copy((char *) character, STRING(node), len, char);
         }
     }
+
+    *flagp |= HASWIDTH;
+    if (len == 1 && UNI_IS_INVARIANT(code_point))
+        *flagp |= SIMPLE;
 }
 
 /*
@@ -10033,13 +10042,12 @@ tryagain:
     case '[':
     {
        char * const oregcomp_parse = ++RExC_parse;
-        ret = regclass(pRExC_state,depth+1);
+        ret = regclass(pRExC_state, flagp,depth+1);
        if (*RExC_parse != ']') {
            RExC_parse = oregcomp_parse;
            vFAIL("Unmatched [");
        }
        nextchar(pRExC_state);
-       *flagp |= HASWIDTH|SIMPLE;
         Set_Node_Length(ret, RExC_parse - oregcomp_parse + 1); /* MJD */
        break;
     }
@@ -10250,7 +10258,7 @@ tryagain:
                }
                RExC_parse--;
 
-                ret = regclass(pRExC_state,depth+1);
+                ret = regclass(pRExC_state, flagp,depth+1);
 
                RExC_end = oldregxend;
                RExC_parse--;
@@ -10258,7 +10266,6 @@ tryagain:
                Set_Node_Offset(ret, parse_start + 2);
                Set_Node_Cur_Length(ret);
                nextchar(pRExC_state);
-               *flagp |= HASWIDTH|SIMPLE;
            }
            break;
         case 'N': 
@@ -10533,6 +10540,9 @@ tryagain:
                             goto loopdone;
                         }
                         p = RExC_parse;
+                        if (ender > 0xff) {
+                            REQUIRE_UTF8;
+                        }
                         break;
                    case 'r':
                        ender = '\r';
@@ -10932,6 +10942,17 @@ tryagain:
 
        loopdone:   /* Jumped to when encounters something that shouldn't be in
                       the node */
+
+            /* I (khw) don't know if you can get here with zero length, but the
+             * old code handled this situation by creating a zero-length EXACT
+             * node.  Might as well be NOTHING instead */
+            if (len == 0) {
+                OP(ret) = NOTHING;
+            }
+            else{
+                alloc_maybe_populate_EXACT(pRExC_state, ret, flagp, len, 
ender);
+            }
+
            RExC_parse = p - 1;
             Set_Node_Cur_Length(ret); /* MJD */
            nextchar(pRExC_state);
@@ -10941,12 +10962,7 @@ tryagain:
                if (iv < 0)
                    vFAIL("Internal disaster");
            }
-           if (len > 0)
-               *flagp |= HASWIDTH;
-           if (len == 1 && UNI_IS_INVARIANT(ender))
-               *flagp |= SIMPLE;
 
-            alloc_maybe_populate_EXACT(pRExC_state, ret, len, 0);
        } /* End of label 'defchar:' */
        break;
     } /* End of giant switch on input character */
@@ -11313,7 +11329,7 @@ S_add_alternate(pTHX_ AV** alternate_ptr, U8* string, 
STRLEN len)
    above 255, a range list is used */
 
 STATIC regnode *
-S_regclass(pTHX_ RExC_state_t *pRExC_state, U32 depth)
+S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 {
     dVAR;
     register UV nextvalue;
@@ -11484,7 +11500,7 @@ parseit:
                     if this makes sense as it does change the behaviour
                     from earlier versions, OTOH that behaviour was broken
                     as well. */
-                    if (! grok_bslash_N(pRExC_state, NULL, &value, NULL, depth,
+                    if (! grok_bslash_N(pRExC_state, NULL, &value, flagp, 
depth,
                                       TRUE /* => charclass */))
                     {
                         goto parseit;
@@ -11990,8 +12006,8 @@ parseit:
                    }
                     if (!SIZE_ONLY) {
                         cp_list = add_cp_to_invlist(cp_list, '-');
-                        element_count++;
                     }
+                    element_count++;
                } else
                    range = 1;  /* yeah, it's a range! */
                continue;       /* but do it the next time */
@@ -12103,6 +12119,7 @@ parseit:
                     if (invert) {
                         op += NALNUM - ALNUM;
                     }
+                    *flagp |= HASWIDTH|SIMPLE;
                     break;
 
                 /* The second group doesn't depend of the charset modifiers.
@@ -12113,6 +12130,7 @@ parseit:
                 case ANYOF_HORIZWS:
                   is_horizws:
                     op = (invert) ? NHORIZWS : HORIZWS;
+                    *flagp |= HASWIDTH|SIMPLE;
                     break;
 
                 case ANYOF_NVERTWS:
@@ -12120,6 +12138,7 @@ parseit:
                     /* FALLTHROUGH */
                 case ANYOF_VERTWS:
                     op = (invert) ? NVERTWS : VERTWS;
+                    *flagp |= HASWIDTH|SIMPLE;
                     break;
 
                 case ANYOF_MAX:
@@ -12159,6 +12178,8 @@ parseit:
             if (invert) {
                 if (! LOC && value == '\n') {
                     op = REG_ANY; /* Optimize [^\n] */
+                    *flagp |= HASWIDTH|SIMPLE;
+                    RExC_naughty++;
                 }
             }
             else if (value < 256 || UTF) {
@@ -12172,6 +12193,7 @@ parseit:
             if (prevvalue == '0') {
                 if (value == '9') {
                     op = (invert) ? NDIGITA : DIGITA;
+                    *flagp |= HASWIDTH|SIMPLE;
                 }
             }
         }
@@ -12205,9 +12227,10 @@ parseit:
                 if (! SIZE_ONLY) {
                     FLAGS(ret) = arg;
                 }
+                *flagp |= HASWIDTH|SIMPLE;
             }
             else if (PL_regkind[op] == EXACT) {
-                alloc_maybe_populate_EXACT(pRExC_state, ret, 0, value);
+                alloc_maybe_populate_EXACT(pRExC_state, ret, flagp, 0, value);
             }
 
             RExC_parse = (char *) cur_parse;
@@ -12219,7 +12242,7 @@ parseit:
 
     if (SIZE_ONLY)
         return ret;
-    /****** !SIZE_ONLY AFTER HERE *********/
+    /****** !SIZE_ONLY (Pass 2) AFTER HERE *********/
 
     /* If folding, we calculate all characters that could fold to or from the
      * ones already on the list */
@@ -12675,6 +12698,7 @@ parseit:
              * it doesn't match anything.  (perluniprops.pod notes such
              * properties) */
             op = OPFAIL;
+            *flagp |= HASWIDTH|SIMPLE;
         }
         else if (start == end) {    /* The range is a single code point */
             if (! invlist_iternext(cp_list, &start, &end)
@@ -12740,12 +12764,16 @@ parseit:
         else if (start == 0) {
             if (end == UV_MAX) {
                 op = SANY;
+                *flagp |= HASWIDTH|SIMPLE;
+                RExC_naughty++;
             }
             else if (end == '\n' - 1
                     && invlist_iternext(cp_list, &start, &end)
                     && start == '\n' + 1 && end == UV_MAX)
             {
                 op = REG_ANY;
+                *flagp |= HASWIDTH|SIMPLE;
+                RExC_naughty++;
             }
         }
 
@@ -12758,7 +12786,7 @@ parseit:
             RExC_parse = (char *)cur_parse;
 
             if (PL_regkind[op] == EXACT) {
-                alloc_maybe_populate_EXACT(pRExC_state, ret, 0, value);
+                alloc_maybe_populate_EXACT(pRExC_state, ret, flagp, 0, value);
             }
 
             SvREFCNT_dec(listsv);
@@ -12899,6 +12927,8 @@ parseit:
        RExC_rxi->data->data[n] = (void*)rv;
        ARG_SET(ret, n);
     }
+
+    *flagp |= HASWIDTH|SIMPLE;
     return ret;
 }
 #undef HAS_NONLOCALE_RUNTIME_PROPERTY_DEFINITION
diff --git a/t/porting/known_pod_issues.dat b/t/porting/known_pod_issues.dat
index 42a1bca..b1882bd 100644
--- a/t/porting/known_pod_issues.dat
+++ b/t/porting/known_pod_issues.dat
@@ -1,4 +1,4 @@
-# This file is the data file for t/porting/podcheck.t.
+# This file is the data file for porting/podcheck.t.
 # There are three types of lines.
 # Comment lines are white-space only or begin with a '#', like this one.  Any
 #   changes you make to the comment lines will be lost when the file is
@@ -197,7 +197,6 @@ ext/pod-html/testdir/perlvar-copy.pod       ? Should you be 
using L<...> instead of 3
 ext/pod-html/testdir/perlvar-copy.pod  Apparent broken link    2
 ext/pod-html/testdir/perlvar-copy.pod  Verbatim line length including indents 
exceeds 79 by    6
 ext/posix/lib/posix.pod        Verbatim line length including indents exceeds 
79 by    13
-ext/re/re.pm   Verbatim line length including indents exceeds 79 by    6
 ext/vms-dclsym/dclsym.pm       ? Should you be using L<...> instead of 1
 ext/vms-dclsym/dclsym.pm       Verbatim line length including indents exceeds 
79 by    1
 ext/vms-stdio/stdio.pm Verbatim line length including indents exceeds 79 by    
1
diff --git a/t/re/pat.t b/t/re/pat.t
index 3cfb363..08f784d 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -19,7 +19,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 452;  # Update this when adding/deleting tests.
+plan tests => 453;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1277,6 +1277,13 @@ EOP
        ok("51" =~ /$o/, "Qr_indirect bare");
     }
 
+    {   # Various flags weren't being set when a [] is optimized into an
+        # EXACTish node
+        ;
+        ;
+        ok("\x{017F}\x{017F}" =~ qr/^[\x{00DF}]?$/i, "[] to EXACTish 
optimization");
+    }
+
 } # End of sub run_tests
 
 1;
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 95f904f..771e441 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -985,6 +985,16 @@ sub run_tests {
                    "Zerolength charname in charclass doesn't match \\\\0"];
         ok $w && $w =~ /Ignoring zero length/,
                  'Ignoring zero length \N{} in character class warning';
+        undef $w;
+        eval q [ok 'xy' =~ /x[\N{EMPTY-STR} y]/x,
+                    'Empty string charname in [] is ignored; finds a following 
character'];
+        ok $w && $w =~ /Ignoring zero length/,
+                 'Ignoring zero length \N{} in character class warning';
+        undef $w;
+        eval q [ok 'x ' =~ /x[\N{EMPTY-STR} y]/,
+                    'Empty string charname in [] is ignored; finds a following 
blank under /x'];
+        ok $w && $w =~ /Ignoring zero length/,
+                 'Ignoring zero length \N{} in character class warning';
 
         ok 'AB'  =~ /(\N{EVIL})/ && $1 eq 'A', 'Charname caching $1';
         ok 'ABC' =~ /(\N{EVIL})/,              'Charname caching $1';
diff --git a/t/re/re_tests b/t/re/re_tests
index 3fdaf80..f44bdc1 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1543,7 +1543,7 @@ a\97      a97     y       $&      a97
 /^\p{L}/       \x{3400}        y       $&      \x{3400}
 
 # RT #89774
-/[s\xDF]/ui    \xDFs   ybT     $&      \xDFs
+/[s\xDF]a/ui   ssa     ybT     $&      ssa
 
 # RT #99928
 /^\R\x0A$/     \x0D\x0A        n       -       -
@@ -1701,4 +1701,6 @@ ab[c\\\](??{"x"})]{3}d    ab\\](d y       -       -
 \w     \x{200D}        y       $&      \x{200D}
 \W     \x{200D}        n       -       -
 
+/^(?d:\xdf|_)*_/i      \x{17f}\x{17f}_ y       $&      \x{17f}\x{17f}_
+
 # vim: softtabstop=0 noexpandtab

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.17.2-304-g97766bb

Reply via email to