[perl.git] branch blead, updated. v5.13.10-129-gd50a4f9

Karl Williamson Sun, 27 Feb 2011 18:22:31 -0800

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/d50a4f90cab527593b2dd218f71b66a6be555490?hp=2644ea75792e7142cc4124f2579ddfe278e562c0>


- Log -----------------------------------------------------------------
commit d50a4f90cab527593b2dd218f71b66a6be555490
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 18:44:43 2011 -0700

    Handle [folds] of 0-255 without swashes
    
    Commit 56ca34cada940c7f6aae9a59da266e541530041e had the side effect of
    causing regular expressions with things like [a-z], or even just [k] to
    go out to disk to read tables to create swashes because it knew that
    some of those characters matched outside the bitmap (and due to
    l1_char_class_tab.h it knew which ones had those matches), but it didn't
    know what the characters were that participated in those folds.
    
    This patch hard-codes the Unicode 6.0 rules into regcomp.c for the
    code points 0-255, so that the very slow utf8_heavy is not invoked on
    them.  (Code points above 255 will continue to invoke it.)  It would,
    of course, be better if these rules could be regen'd into regcomp.c, as
    there is a risk that the standard will change, and the code will not.
    But I don't think that has ever happened; in other words, I think that
    the rules haven't changed so far since Day 1 of Unicode.  (That would
    not be the case if we were doing simple case folding, as the capital
    sharp ss which folds to U+00DF was added later.)  And the Standard is
    getting more stable in this area.  I believe one of their stability
    policies now forbid them from adding something that simply folds to
    one of the characters that already has a fold, such as M and m.
    Ligatures are frowned on, and I doubt that new ones would be encoded,
    so that leaves a new Unicode character that folds to a Latin-1 plus some
    sort of mark.  For those, this code is a no-op, so those aren't a
    problem either.

M       pod/perldiag.pod
M       regcomp.c

commit 2335b3d39eb70759d992779a5e8e11443648e5dd
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 18:31:51 2011 -0700

    regcomp.c: Add deprecation macro with extra param

M       regcomp.c

commit 537429566e5149c6661a619b1f9a77e25ba30b8f
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 15:19:04 2011 -0700

    regcomp.c: More prep for bitmap/nonbitmap folds
    
    This sets things up in preparation for a future commit that will
    move calculating all folds involving characters in the bit map.

M       regcomp.c

commit 5bfec14d8f541613f52ee87efb2cd875bad0cb37
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 14:21:47 2011 -0700

    regcomp.c: Place marker for 2nd inversion list
    
    The set_regclass_bit functions will be adding to a new inversion list.
    This declares that list and passes it to them.

M       regcomp.c

commit 6c8241da60d5cd493ba55cff550f6579ba4a6cfd
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 14:12:57 2011 -0700

    Change to use new add_cp_to_invlist()

M       regcomp.c

commit 2c6aa59398a9d39f981ea691c06f8b98ad60b54f
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 14:04:26 2011 -0700

    regcomp.c: Add parameters to fcns
    
    A pointer to the list of multi-char folds in an ANYOF node is now passed
    to the routines that set the bit map.  This is in preparation for those
    routines to add to the list

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit 1d791ab2d32bd22ff8566e9d86204da48e96a040
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 13:55:03 2011 -0700

    regcomp.c: Convert old-style to inversion list
    
    The code that handles a false range in a [character class] hadn't been
    converted to use inversion lists

M       regcomp.c

commit c229b64cfdf63c0671f41130fa3b2bb15dbf9779
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 17:51:33 2011 -0700

    regcomp.c: Add fcn add_cp_to_invlist()
    
    This is just an inline shorthand when a single code point is all that is
    needed.  A macro could have been used instead, but this just seemed nicer.

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit c93d5d8be00caac56508448f35362b1b3fa58f02
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 17:45:46 2011 -0700

    regcomp.c: Move code to common place
    
    THis is part of the refactoring of the code that sets the alternate array
    for multi-char folds.  Changing the node type to ANYOFV can be done at
    the last second, in pass 2, as it doesn't change any sizing, etc.

M       regcomp.c

commit c8453963c66d869ec2f19fde5d46805bcf04cf4f
Author: Karl Williamson <[email protected]>
Date:   Sun Feb 27 13:12:49 2011 -0700

    regcomp.c: Factor code into a function.
    
    A future commit uses this same code, so put it into a common place.

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit f508a60734257a13d3758f4d92f254fdb5ab797c
Author: Karl Williamson <[email protected]>
Date:   Sat Feb 26 22:02:26 2011 -0700

    Add #defines for 2 Latin1 chars
    
    These will be used in a future commit; the ordinals are different on
    EBCDIC vs. ASCII

M       utf8.h
M       utfebcdic.h

commit 78a0d3cc3d649167a7a58cbe7d5f494d42d154bd
Author: Karl Williamson <[email protected]>
Date:   Sat Feb 26 19:38:29 2011 -0700

    Move some #defines
    
    These were defined in a .c, but now there is need for them in another .c,
    so move them to a header.

M       pp.c
M       utf8.h

commit 18af16d0e66326ce358cf094d4b27aa663e7a10d
Author: Karl Williamson <[email protected]>
Date:   Sat Feb 26 10:21:09 2011 -0700

    regcomp.c: Remove no longer necessary tests
    
    A previous commit changed add_range_to_invlist() to do the creation
    that these lines did.

M       regcomp.c

commit c52a3e710d22e427503266cd12c740eaf81515ea
Author: Karl Williamson <[email protected]>
Date:   Sat Feb 26 10:16:20 2011 -0700

    regcomp.c: accept NULL as inversion list param
    
    Change the function add_range_to_invlist() to accept NULL as the
    inversion list, in which case it creates it.  A common usage of this
    function is to create the list if it doesn't exist before calling it, so
    this just makes that code once.

M       embed.fnc
M       proto.h
M       regcomp.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc        |    8 +-
 embed.h          |    6 +-
 pod/perldiag.pod |    9 ++
 pp.c             |    6 -
 proto.h          |   30 ++++--
 regcomp.c        |  285 ++++++++++++++++++++++++++++++++++++++++++------------
 utf8.h           |    4 +
 utfebcdic.h      |    6 +
 8 files changed, 271 insertions(+), 83 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 8663b21..6f4f8fe 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -979,8 +979,9 @@ Ap  |SV*    |regclass_swash |NULLOK const regexp *prog \
                                |NN const struct regnode *node|bool doinit \
                                |NULLOK SV **listsvp|NULLOK SV **altsvp
 #ifdef PERL_IN_REGCOMP_C
-EMi    |U8     |set_regclass_bit|NN struct RExC_state_t* pRExC_state|NN 
regnode* node|const U8 value|NN HV** nonbitmap_ptr
-EMs    |U8     |set_regclass_bit_fold|NN struct RExC_state_t *pRExC_state|NN 
regnode* node|const U8 value|NN HV** nonbitmap_ptr
+EMi    |U8     |set_regclass_bit|NN struct RExC_state_t* pRExC_state|NN 
regnode* node|const U8 value|NN HV** invlist_ptr|NN AV** alternate_ptr
+EMs    |U8     |set_regclass_bit_fold|NN struct RExC_state_t *pRExC_state|NN 
regnode* node|const U8 value|NN HV** invlist_ptr|NN AV** alternate_ptr
+EMs    |void   |add_alternate  |NN AV** alternate_ptr|NN U8* string|STRLEN len
 #endif
 Ap     |I32    |pregexec       |NN REGEXP * const prog|NN char* stringarg \
                                |NN char* strend|NN char* strbeg|I32 minend \
@@ -1300,7 +1301,8 @@ EXMpR     |HV*    |_new_invlist   |IV initial_size
 EXMpR  |HV*    |_swash_to_invlist      |NN SV* const swash
 EXMp   |void   |_append_range_to_invlist   |NN HV* const invlist|const UV 
start|const UV end
 #ifdef PERL_IN_REGCOMP_C
-EsMR   |HV*    |add_range_to_invlist   |NN HV* const invlist|const UV 
start|const UV end
+EiMR   |HV*    |add_cp_to_invlist      |NULLOK HV* invlist|const UV cp
+EsMR   |HV*    |add_range_to_invlist   |NULLOK HV* invlist|const UV 
start|const UV end
 EiMR   |UV*    |invlist_array  |NN HV* const invlist
 EiM    |void   |invlist_destroy        |NN HV* const invlist
 EsM    |void   |invlist_extend    |NN HV* const invlist|const UV len
diff --git a/embed.h b/embed.h
index 727e921..573657e 100644
--- a/embed.h
+++ b/embed.h
@@ -866,6 +866,8 @@
 #define regcurly(a)            S_regcurly(aTHX_ a)
 #  endif
 #  if defined(PERL_IN_REGCOMP_C)
+#define add_alternate(a,b,c)   S_add_alternate(aTHX_ a,b,c)
+#define add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b)
 #define add_data               S_add_data
 #define add_range_to_invlist(a,b,c)    S_add_range_to_invlist(aTHX_ a,b,c)
 #define checkposixcc(a)                S_checkposixcc(aTHX_ a)
@@ -907,8 +909,8 @@
 #define reguni(a,b,c)          S_reguni(aTHX_ a,b,c)
 #define regwhite               S_regwhite
 #define scan_commit(a,b,c,d)   S_scan_commit(aTHX_ a,b,c,d)
-#define set_regclass_bit(a,b,c,d)      S_set_regclass_bit(aTHX_ a,b,c,d)
-#define set_regclass_bit_fold(a,b,c,d) S_set_regclass_bit_fold(aTHX_ a,b,c,d)
+#define set_regclass_bit(a,b,c,d,e)    S_set_regclass_bit(aTHX_ a,b,c,d,e)
+#define set_regclass_bit_fold(a,b,c,d,e)       S_set_regclass_bit_fold(aTHX_ 
a,b,c,d,e)
 #define study_chunk(a,b,c,d,e,f,g,h,i,j,k)     S_study_chunk(aTHX_ 
a,b,c,d,e,f,g,h,i,j,k)
 #  endif
 #  if defined(PERL_IN_REGEXEC_C)
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index aae2dd3..ce2a5d2 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3607,6 +3607,15 @@ redirected it with select().)
 "Can't locate object method \"%s\" via package \"%s\"".  It often means
 that a method requires a package that has not been loaded.
 
+=item Perl folding rules are not up-to-date for 0x%x; please use the perlbug 
utility to report;
+
+(W regex, deprecated) You used a regular expression with
+case-insensitive matching, and there is a bug in Perl in which the
+built-in regular expression folding rules are not accurate.  This may
+lead to incorrect results.  Please report this as a bug using the
+"perlbug" utility.  (This message is marked deprecated, so that it by
+default will be turned-on.)
+
 =item Perl_my_%s() not available
 
 (F) Your platform has very uncommon byte-order and integer size,
diff --git a/pp.c b/pp.c
index d857c7e..d6f0332 100644
--- a/pp.c
+++ b/pp.c
@@ -3842,12 +3842,6 @@ PP(pp_crypt)
 /* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level.  So 
  * most comments below say UTF-8, when in fact they mean UTF-EBCDIC as well */
 
-/* Both the characters below can be stored in two UTF-8 bytes.  In UTF-8 the 
max
- * character that 2 bytes can hold is U+07FF, and in UTF-EBCDIC it is U+03FF.
- * See http://www.unicode.org/unicode/reports/tr16 */
-#define LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS 0x0178   /* Also is title case */
-#define GREEK_CAPITAL_LETTER_MU 0x039C /* Upper and title case of MICRON */
-
 /* Below are several macros that generate code */
 /* Generates code to store a unicode codepoint c that is known to occupy
  * exactly two UTF-8 and UTF-EBCDIC bytes; it is stored into p and p+1. */
diff --git a/proto.h b/proto.h
index d4642aa..b915977 100644
--- a/proto.h
+++ b/proto.h
@@ -5954,6 +5954,15 @@ STATIC SV *      S_space_join_names_mortal(pTHX_ char 
*const *array)
 
 #endif
 #if defined(PERL_IN_REGCOMP_C)
+STATIC void    S_add_alternate(pTHX_ AV** alternate_ptr, U8* string, STRLEN 
len)
+                       __attribute__nonnull__(pTHX_1)
+                       __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_ADD_ALTERNATE \
+       assert(alternate_ptr); assert(string)
+
+PERL_STATIC_INLINE HV* S_add_cp_to_invlist(pTHX_ HV* invlist, const UV cp)
+                       __attribute__warn_unused_result__;
+
 STATIC U32     S_add_data(struct RExC_state_t *pRExC_state, U32 n, const char 
*s)
                        __attribute__warn_unused_result__
                        __attribute__nonnull__(1)
@@ -5961,11 +5970,8 @@ STATIC U32       S_add_data(struct RExC_state_t 
*pRExC_state, U32 n, const char *s)
 #define PERL_ARGS_ASSERT_ADD_DATA      \
        assert(pRExC_state); assert(s)
 
-STATIC HV*     S_add_range_to_invlist(pTHX_ HV* const invlist, const UV start, 
const UV end)
-                       __attribute__warn_unused_result__
-                       __attribute__nonnull__(pTHX_1);
-#define PERL_ARGS_ASSERT_ADD_RANGE_TO_INVLIST  \
-       assert(invlist)
+STATIC HV*     S_add_range_to_invlist(pTHX_ HV* invlist, const UV start, const 
UV end)
+                       __attribute__warn_unused_result__;
 
 STATIC void    S_checkposixcc(pTHX_ struct RExC_state_t *pRExC_state)
                        __attribute__nonnull__(pTHX_1);
@@ -6204,19 +6210,21 @@ STATIC void     S_scan_commit(pTHX_ const struct 
RExC_state_t *pRExC_state, struct s
 #define PERL_ARGS_ASSERT_SCAN_COMMIT   \
        assert(pRExC_state); assert(data); assert(minlenp)
 
-PERL_STATIC_INLINE U8  S_set_regclass_bit(pTHX_ struct RExC_state_t* 
pRExC_state, regnode* node, const U8 value, HV** nonbitmap_ptr)
+PERL_STATIC_INLINE U8  S_set_regclass_bit(pTHX_ struct RExC_state_t* 
pRExC_state, regnode* node, const U8 value, HV** invlist_ptr, AV** 
alternate_ptr)
                        __attribute__nonnull__(pTHX_1)
                        __attribute__nonnull__(pTHX_2)
-                       __attribute__nonnull__(pTHX_4);
+                       __attribute__nonnull__(pTHX_4)
+                       __attribute__nonnull__(pTHX_5);
 #define PERL_ARGS_ASSERT_SET_REGCLASS_BIT      \
-       assert(pRExC_state); assert(node); assert(nonbitmap_ptr)
+       assert(pRExC_state); assert(node); assert(invlist_ptr); 
assert(alternate_ptr)
 
-STATIC U8      S_set_regclass_bit_fold(pTHX_ struct RExC_state_t *pRExC_state, 
regnode* node, const U8 value, HV** nonbitmap_ptr)
+STATIC U8      S_set_regclass_bit_fold(pTHX_ struct RExC_state_t *pRExC_state, 
regnode* node, const U8 value, HV** invlist_ptr, AV** alternate_ptr)
                        __attribute__nonnull__(pTHX_1)
                        __attribute__nonnull__(pTHX_2)
-                       __attribute__nonnull__(pTHX_4);
+                       __attribute__nonnull__(pTHX_4)
+                       __attribute__nonnull__(pTHX_5);
 #define PERL_ARGS_ASSERT_SET_REGCLASS_BIT_FOLD \
-       assert(pRExC_state); assert(node); assert(nonbitmap_ptr)
+       assert(pRExC_state); assert(node); assert(invlist_ptr); 
assert(alternate_ptr)
 
 STATIC I32     S_study_chunk(pTHX_ struct RExC_state_t *pRExC_state, regnode 
**scanp, I32 *minlenp, I32 *deltap, regnode *last, struct scan_data_t *data, 
I32 stopparen, U8* recursed, struct regnode_charc ... [44 chars truncated]
                        __attribute__nonnull__(pTHX_1)
diff --git a/regcomp.c b/regcomp.c
index 932da1a..76579a0 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -509,6 +509,13 @@ static const scan_data_t zero_scan_data =
            (int)offset, RExC_precomp, RExC_precomp + offset);          \
 } STMT_END
 
+#define        ckWARN2regdep(loc,m, a1) STMT_START {                           
\
+    const IV offset = loc - RExC_precomp;                              \
+    Perl_ck_warner_d(aTHX_ packWARN2(WARN_DEPRECATED, WARN_REGEXP),    \
+           m REPORT_LOCATION,                                          \
+           a1, (int)offset, RExC_precomp, RExC_precomp + offset);      \
+} STMT_END
+
 #define        ckWARN2reg(loc, m, a1) STMT_START {                             
\
     const IV offset = loc - RExC_precomp;                              \
     Perl_ck_warner(aTHX_ packWARN(WARN_REGEXP), m REPORT_LOCATION,     \
@@ -6249,18 +6256,25 @@ S_invlist_intersection(pTHX_ HV* const a, HV* const b)
 }
 
 STATIC HV*
-S_add_range_to_invlist(pTHX_ HV* const invlist, const UV start, const UV end)
+S_add_range_to_invlist(pTHX_ HV* invlist, const UV start, const UV end)
 {
     /* Add the range from 'start' to 'end' inclusive to the inversion list's
      * set.  A pointer to the inversion list is returned.  This may actually be
-     * a new list, in which case the passed in one has been destroyed */
+     * a new list, in which case the passed in one has been destroyed.  The
+     * passed in inversion list can be NULL, in which case a new one is created
+     * with just the one range in it */
 
     HV* range_invlist;
     HV* added_invlist;
+    UV len;
 
-    UV len = invlist_len(invlist);
-
-    PERL_ARGS_ASSERT_ADD_RANGE_TO_INVLIST;
+    if (invlist == NULL) {
+       invlist = _new_invlist(2);
+       len = 0;
+    }
+    else {
+       len = invlist_len(invlist);
+    }
 
     /* If comes after the final entry, can just append it to the end */
     if (len == 0
@@ -6287,6 +6301,11 @@ S_add_range_to_invlist(pTHX_ HV* const invlist, const UV 
start, const UV end)
     return added_invlist;
 }
 
+PERL_STATIC_INLINE HV*
+S_add_cp_to_invlist(pTHX_ HV* invlist, const UV cp) {
+    return add_range_to_invlist(invlist, cp, cp);
+}
+
 /* End of inversion list object */
 
 /*
@@ -9111,14 +9130,14 @@ S_checkposixcc(pTHX_ RExC_state_t *pRExC_state)
 ANYOF_##NAME:                                                                  
\
        for (value = 0; value < 256; value++)                                  \
            if (TEST)                                                          \
-           stored += set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);  \
+           stored += set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);  \
     yesno = '+';                                                               
\
     what = WORD;                                                               
\
     break;                                                                     
\
 case ANYOF_N##NAME:                                                            
\
        for (value = 0; value < 256; value++)                                  \
            if (!TEST)                                                         \
-           stored += set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);  \
+           stored += set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);  \
     yesno = '!';                                                               
\
     what = WORD;                                                               
\
     break
@@ -9133,14 +9152,14 @@ ANYOF_##NAME:                                           
                       \
     else if (UNI_SEMANTICS) {                                                  
\
         for (value = 0; value < 256; value++) {                                
\
             if (TEST_8(value)) stored +=                                       
\
-                      set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);  \
+                      set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);  \
         }                                                                      
\
     }                                                                          
\
     else {                                                                     
\
         for (value = 0; value < 128; value++) {                                
\
             if (TEST_7(UNI_TO_NATIVE(value))) stored +=                        
\
                set_regclass_bit(pRExC_state, ret,                     \
-                                  (U8) UNI_TO_NATIVE(value), &nonbitmap);      
           \
+                                  (U8) UNI_TO_NATIVE(value), &l1_fold_invlist, 
&unicode_alternate);                 \
         }                                                                      
\
     }                                                                          
\
     yesno = '+';                                                               
\
@@ -9151,18 +9170,18 @@ case ANYOF_N##NAME:                                     
                       \
     else if (UNI_SEMANTICS) {                                                  
\
         for (value = 0; value < 256; value++) {                                
\
             if (! TEST_8(value)) stored +=                                     
\
-                   set_regclass_bit(pRExC_state, ret, (U8) value, &nonbitmap); 
   \
+                   set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);    \
         }                                                                      
\
     }                                                                          
\
     else {                                                                     
\
         for (value = 0; value < 128; value++) {                                
\
             if (! TEST_7(UNI_TO_NATIVE(value))) stored += set_regclass_bit(  \
-                       pRExC_state, ret, (U8) UNI_TO_NATIVE(value), 
&nonbitmap);    \
+                       pRExC_state, ret, (U8) UNI_TO_NATIVE(value), 
&l1_fold_invlist, &unicode_alternate);    \
         }                                                                      
\
        if (AT_LEAST_ASCII_RESTRICTED) {                                       \
            for (value = 128; value < 256; value++) {                          \
              stored += set_regclass_bit(                                     \
-                          pRExC_state, ret, (U8) UNI_TO_NATIVE(value), 
&nonbitmap); \
+                          pRExC_state, ret, (U8) UNI_TO_NATIVE(value), 
&l1_fold_invlist, &unicode_alternate); \
            }                                                                  \
            ANYOF_FLAGS(ret) |= ANYOF_UNICODE_ALL;                             \
        }                                                                      \
@@ -9196,19 +9215,22 @@ case ANYOF_N##NAME:                                     
                       \
 #endif
 
 STATIC U8
-S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, regnode* node, const 
U8 value, HV** nonbitmap_ptr)
+S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, regnode* node, const 
U8 value, HV** invlist_ptr, AV** alternate_ptr)
 {
 
     /* Handle the setting of folds in the bitmap for non-locale ANYOF nodes.
      * Locale folding is done at run-time, so this function should not be
      * called for nodes that are for locales.
      *
-     * This function simply sets the bit corresponding to the fold of the input
+     * This function sets the bit corresponding to the fold of the input
      * 'value', if not already set.  The fold of 'f' is 'F', and the fold of
      * 'F' is 'f'.
      *
-     * It also sets any necessary flags, and returns the number of bits that
-     * actually changed from 0 to 1 */
+     * It also knows about the characters that are in the bitmap that have
+     * folds that are matchable only outside it, and sets the appropriate lists
+     * and flags.
+     *
+     * It returns the number of bits that actually changed from 0 to 1 */
 
     U8 stored = 0;
     U8 fold;
@@ -9223,20 +9245,111 @@ S_set_regclass_bit_fold(pTHX_ RExC_state_t 
*pRExC_state, regnode* node, const U8
         ANYOF_BITMAP_SET(node, fold);
         stored++;
     }
-    if 
((_HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(value)
 && (! isASCII(value) || ! MORE_ASCII_RESTRICTED))
-       || (! UNI_SEMANTICS
-           && ! isASCII(value)
-           && PL_fold_latin1[value] != value))
-    {   /* A character that has a fold outside of Latin1 matches outside the
-           bitmap, but only when the target string is utf8.  Similarly when we
-           don't have unicode semantics for the above ASCII Latin-1 characters,
-           and they have a fold, they should match if the target is utf8, and
-          not otherwise.  We add the character here, and calculate the fold
-          later, with the other nonbitmap folds */
-       if (! *nonbitmap_ptr) {
-           *nonbitmap_ptr = _new_invlist(2);
+    if 
(_HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(value)
 && (! isASCII(value) || ! MORE_ASCII_RESTRICTED)) {
+       /* Certain Latin1 characters have matches outside the bitmap.  To get
+        * here, 'value' is one of those characters.   None of these matches is
+        * valid for ASCII characters under /aa, which have been excluded by
+        * the 'if' above.  The matches fall into three categories:
+        * 1) They are singly folded-to or -from an above 255 character, as
+        *    LATIN SMALL LETTER Y WITH DIAERESIS and LATIN CAPITAL LETTER Y
+        *    WITH DIAERESIS;
+        * 2) They are part of a multi-char fold with another character in the
+        *    bitmap, only LATIN SMALL LETTER SHARP S => "ss" fits that bill;
+        * 3) They are part of a multi-char fold with a character not in the
+        *    bitmap, such as various ligatures.
+        * We aren't dealing fully with multi-char folds, except we do deal
+        * with the pattern containing a character that has a multi-char fold
+        * (not so much the inverse).
+        * For types 1) and 3), the matches only happen when the target string
+        * is utf8; that's not true for 2), and we set a flag for it.
+        *
+        * The code below adds to the passed in inversion list the single fold
+        * closures for 'value'.  The values are hard-coded here so that an
+        * innocent-looking character class, like /[ks]/i won't have to go out
+        * to disk to find the possible matches.  XXX It would be better to
+        * generate these via regen, in case a new version of the Unicode
+        * standard adds new mappings, though that is not really likely. */
+       switch (value) {
+           case 'k':
+           case 'K':
+               /* KELVIN SIGN */
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr, 0x212A);
+               break;
+           case 's':
+           case 'S':
+               /* LATIN SMALL LETTER LONG S */
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr, 0x017F);
+               break;
+           case MICRO_SIGN:
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr,
+                                                GREEK_SMALL_LETTER_MU);
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr,
+                                                GREEK_CAPITAL_LETTER_MU);
+               break;
+           case LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE:
+           case LATIN_SMALL_LETTER_A_WITH_RING_ABOVE:
+               /* ANGSTROM SIGN */
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr, 0x212B);
+               if (DEPENDS_SEMANTICS) {    /* See DEPENDS comment below */
+                   *invlist_ptr = add_cp_to_invlist(*invlist_ptr,
+                                                    PL_fold_latin1[value]);
+               }
+               break;
+           case LATIN_SMALL_LETTER_Y_WITH_DIAERESIS:
+               *invlist_ptr = add_cp_to_invlist(*invlist_ptr,
+                                       LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS);
+               break;
+           case LATIN_SMALL_LETTER_SHARP_S:
+
+               /* Under /d and /u, this can match the two chars "ss" */
+               if (! MORE_ASCII_RESTRICTED) {
+                   add_alternate(alternate_ptr, (U8 *) "ss", 2);
+
+                   /* And under /u, it can match even if the target is not
+                    * utf8 */
+                   if (UNI_SEMANTICS) {
+                       ANYOF_FLAGS(node) |= ANYOF_NONBITMAP_NON_UTF8;
+                   }
+               }
+               break;
+           case 'F': case 'f':
+           case 'I': case 'i':
+           case 'L': case 'l':
+           case 'T': case 't':
+               /* These all are targets of multi-character folds, which can
+                * occur with only non-Latin1 characters in the fold, so they
+                * can match if the target string isn't UTF-8 */
+               ANYOF_FLAGS(node) |= ANYOF_NONBITMAP_NON_UTF8;
+               break;
+           case 'A': case 'a':
+           case 'H': case 'h':
+           case 'J': case 'j':
+           case 'N': case 'n':
+           case 'W': case 'w':
+           case 'Y': case 'y':
+               /* These all are targets of multi-character folds, which occur
+                * only with a non-Latin1 character as part of the fold, so
+                * they can't match unless the target string is in UTF-8, so no
+                * action here is necessary */
+               break;
+           default:
+               /* Use deprecated warning to increase the chances of this
+                * being output */
+               ckWARN2regdep(RExC_parse, "Perl folding rules are not 
up-to-date for 0x%x; please use the perlbug utility to report;", value);
+               break;
        }
-       *nonbitmap_ptr = add_range_to_invlist(*nonbitmap_ptr, value, value);
+    }
+    else if (DEPENDS_SEMANTICS
+           && ! isASCII(value)
+           && PL_fold_latin1[value] != value)
+    {
+          /* Under DEPENDS rules, non-ASCII Latin1 characters match their
+           * folds only when the target string is in UTF-8.  We add the fold
+           * here to the list of things to match outside the bitmap, which
+           * won't be looked at unless it is UTF8 (or else if something else
+           * says to look even if not utf8, but those things better not happen
+           * under DEPENDS semantics. */
+       *invlist_ptr = add_cp_to_invlist(*invlist_ptr, PL_fold_latin1[value]);
     }
 
     return stored;
@@ -9244,7 +9357,7 @@ S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, 
regnode* node, const U8
 
 
 PERL_STATIC_INLINE U8
-S_set_regclass_bit(pTHX_ RExC_state_t *pRExC_state, regnode* node, const U8 
value, HV** nonbitmap_ptr)
+S_set_regclass_bit(pTHX_ RExC_state_t *pRExC_state, regnode* node, const U8 
value, HV** invlist_ptr, AV** alternate_ptr)
 {
     /* This inline function sets a bit in the bitmap if not already set, and if
      * appropriate, its fold, returning the number of bits that actually
@@ -9262,12 +9375,30 @@ S_set_regclass_bit(pTHX_ RExC_state_t *pRExC_state, 
regnode* node, const U8 valu
     stored = 1;
 
     if (FOLD && ! LOC) {       /* Locale folds aren't known until runtime */
-       stored += set_regclass_bit_fold(pRExC_state, node, value, 
nonbitmap_ptr);
+       stored += set_regclass_bit_fold(pRExC_state, node, value, invlist_ptr, 
alternate_ptr);
     }
 
     return stored;
 }
 
+STATIC void
+S_add_alternate(pTHX_ AV** alternate_ptr, U8* string, STRLEN len)
+{
+    /* Adds input 'string' with length 'len' to the ANYOF node's unicode
+     * alternate list, pointed to by 'alternate_ptr'.  This is an array of
+     * the multi-character folds of characters in the node */
+    SV *sv;
+
+    PERL_ARGS_ASSERT_ADD_ALTERNATE;
+
+    if (! *alternate_ptr) {
+       *alternate_ptr = newAV();
+    }
+    sv = newSVpvn_utf8((char*)string, len, TRUE);
+    av_push(*alternate_ptr, sv);
+    return;
+}
+
 /*
    parse a class specification and produce either an ANYOF node that
    matches the pattern or perhaps will be optimized into an EXACTish node
@@ -9292,7 +9423,27 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, U32 depth)
     STRLEN initial_listsv_len = 0; /* Kind of a kludge to see if it is more
                                      than just initialized.  */
     UV n;
+
+    /* code points this node matches that can't be stored in the bitmap */
     HV* nonbitmap = NULL;
+
+    /* The items that are to match that aren't stored in the bitmap, but are a
+     * result of things that are stored there.  This is the fold closure of
+     * such a character, either because it has DEPENDS semantics and shouldn't
+     * be matched unless the target string is utf8, or is a code point that is
+     * too large for the bit map, as for example, the fold of the MICRO SIGN is
+     * above 255.  This all is solely for performance reasons.  By having this
+     * code know the outside-the-bitmap folds that the bitmapped characters are
+     * involved with, we don't have to go out to disk to find the list of
+     * matches, unless the character class includes code points that aren't
+     * storable in the bit map.  That means that a character class with an 's'
+     * in it, for example, doesn't need to go out to disk to find everything
+     * that matches.  A 2nd list is used so that the 'nonbitmap' list is kept
+     * empty unless there is something whose fold we don't know about, and will
+     * have to go out to the disk to find. */
+    HV* l1_fold_invlist = NULL;
+
+    /* List of multi-character folds that are matched by this node */
     AV* unicode_alternate  = NULL;
 #ifdef EBCDIC
     UV literal_endpoint = 0;
@@ -9584,7 +9735,8 @@ parseit:
            }
 
            /* a bad range like a-\d, a-[:digit:].  The '-' is taken as a
-            * literal */
+            * literal, as is the character that began the false range, i.e.
+            * the 'a' in the examples */
            if (range) {
                if (!SIZE_ONLY) {
                    const int w =
@@ -9594,15 +9746,14 @@ parseit:
                               "False [] range \"%*.*s\"",
                               w, w, rangebegin);
 
+                   stored +=
+                         set_regclass_bit(pRExC_state, ret, '-', 
&l1_fold_invlist, &unicode_alternate);
                    if (prevvalue < 256) {
                        stored +=
-                         set_regclass_bit(pRExC_state, ret, (U8) prevvalue, 
&nonbitmap);
-                       stored +=
-                         set_regclass_bit(pRExC_state, ret, '-', &nonbitmap);
+                         set_regclass_bit(pRExC_state, ret, (U8) prevvalue, 
&l1_fold_invlist, &unicode_alternate);
                    }
                    else {
-                       Perl_sv_catpvf(aTHX_ listsv,
-                          "%04"UVxf"\n%04"UVxf"\n", (UV)prevvalue, (UV) '-');
+                       nonbitmap = add_cp_to_invlist(nonbitmap, prevvalue);
                    }
                }
 
@@ -9649,7 +9800,7 @@ parseit:
                    else {
                        for (value = 0; value < 128; value++)
                            stored +=
-                              set_regclass_bit(pRExC_state, ret, (U8) 
ASCII_TO_NATIVE(value), &nonbitmap);
+                              set_regclass_bit(pRExC_state, ret, (U8) 
ASCII_TO_NATIVE(value), &l1_fold_invlist, &unicode_alternate);
                    }
                    yesno = '+';
                    what = NULL;        /* Doesn't match outside ascii, so
@@ -9661,7 +9812,7 @@ parseit:
                    else {
                        for (value = 128; value < 256; value++)
                            stored +=
-                              set_regclass_bit(pRExC_state, ret, (U8) 
ASCII_TO_NATIVE(value), &nonbitmap);
+                              set_regclass_bit(pRExC_state, ret, (U8) 
ASCII_TO_NATIVE(value), &l1_fold_invlist, &unicode_alternate);
                    }
                    ANYOF_FLAGS(ret) |= ANYOF_UNICODE_ALL;
                    yesno = '!';
@@ -9674,7 +9825,7 @@ parseit:
                        /* consecutive digits assumed */
                        for (value = '0'; value <= '9'; value++)
                            stored +=
-                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);
+                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);
                    }
                    yesno = '+';
                    what = POSIX_CC_UNI_NAME("Digit");
@@ -9686,10 +9837,10 @@ parseit:
                        /* consecutive digits assumed */
                        for (value = 0; value < '0'; value++)
                            stored +=
-                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);
+                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);
                        for (value = '9' + 1; value < 256; value++)
                            stored +=
-                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&nonbitmap);
+                              set_regclass_bit(pRExC_state, ret, (U8) value, 
&l1_fold_invlist, &unicode_alternate);
                    }
                    yesno = '!';
                    what = POSIX_CC_UNI_NAME("Digit");
@@ -9738,7 +9889,7 @@ parseit:
                    }
                    if (!SIZE_ONLY)
                        stored +=
-                            set_regclass_bit(pRExC_state, ret, '-', 
&nonbitmap);
+                            set_regclass_bit(pRExC_state, ret, '-', 
&l1_fold_invlist, &unicode_alternate);
                } else
                    range = 1;  /* yeah, it's a range! */
                continue;       /* but do it the next time */
@@ -9767,28 +9918,25 @@ parseit:
                        for (i = prevvalue; i <= ceilvalue; i++)
                            if (isLOWER(i) && !ANYOF_BITMAP_TEST(ret,i)) {
                                stored +=
-                                  set_regclass_bit(pRExC_state, ret, (U8) i, 
&nonbitmap);
+                                  set_regclass_bit(pRExC_state, ret, (U8) i, 
&l1_fold_invlist, &unicode_alternate);
                            }
                    } else {
                        for (i = prevvalue; i <= ceilvalue; i++)
                            if (isUPPER(i) && !ANYOF_BITMAP_TEST(ret,i)) {
                                stored +=
-                                  set_regclass_bit(pRExC_state, ret, (U8) i, 
&nonbitmap);
+                                  set_regclass_bit(pRExC_state, ret, (U8) i, 
&l1_fold_invlist, &unicode_alternate);
                            }
                    }
                }
                else
 #endif
                      for (i = prevvalue; i <= ceilvalue; i++) {
-                       stored += set_regclass_bit(pRExC_state, ret, (U8) i, 
&nonbitmap);
+                       stored += set_regclass_bit(pRExC_state, ret, (U8) i, 
&l1_fold_invlist, &unicode_alternate);
                      }
          }
          if (value > 255) {
            const UV prevnatvalue  = NATIVE_TO_UNI(prevvalue);
            const UV natvalue      = NATIVE_TO_UNI(value);
-           if (! nonbitmap) {
-               nonbitmap = _new_invlist(2);
-           }
            nonbitmap = add_range_to_invlist(nonbitmap, prevnatvalue, natvalue);
        }
 #if 0
@@ -9904,6 +10052,15 @@ parseit:
     /****** !SIZE_ONLY AFTER HERE *********/
 
     /* Finish up the non-bitmap entries */
+    if (l1_fold_invlist) {
+       if (nonbitmap) {
+           nonbitmap = invlist_union(nonbitmap, l1_fold_invlist);
+       }
+       else {
+           nonbitmap = l1_fold_invlist;
+       }
+        l1_fold_invlist = NULL;
+    }
     if (nonbitmap) {
        UV i;
 
@@ -9983,7 +10140,6 @@ parseit:
                         * these multicharacter foldings, to be later saved as
                         * part of the additional "s" data. */
                        if (! RExC_in_lookbehind) {
-                           SV *sv;
                            U8* loc = foldbuf;
                            U8* e = foldbuf + foldlen;
 
@@ -10018,14 +10174,7 @@ parseit:
                                }
                            }
 
-                           if (!unicode_alternate) {
-                               unicode_alternate = newAV();
-                           }
-                           sv = newSVpvn_utf8((char*)foldbuf, foldlen, TRUE);
-                           av_push(unicode_alternate, sv);
-
-                           /* This node is variable length */
-                           OP(ret) = ANYOFV;
+                           add_alternate(&unicode_alternate, foldbuf, foldlen);
                        end_multi_fold: ;
                        }
                    }
@@ -10061,7 +10210,7 @@ parseit:
                                }
 
                                if (c < 256 && AT_LEAST_UNI_SEMANTICS) {
-                                   stored += set_regclass_bit(pRExC_state, 
ret, (U8) c, &nonbitmap);
+                                   stored += set_regclass_bit(pRExC_state, 
ret, (U8) c, &l1_fold_invlist, &unicode_alternate);
                                }
                                    /* It may be that the code point is already
                                     * in this range or already in the bitmap,
@@ -10070,7 +10219,7 @@ parseit:
                                         && (c > 255
                                             || ! ANYOF_BITMAP_TEST(ret, c)))
                                {
-                                   nonbitmap = add_range_to_invlist(nonbitmap, 
c, c);
+                                   nonbitmap = add_cp_to_invlist(nonbitmap, c);
                                }
                            }
                        }
@@ -10081,6 +10230,16 @@ parseit:
        } /* End of processing all the folds */
     }
 
+    /* Combine the two lists into one. */
+    if (l1_fold_invlist) {
+       if (nonbitmap) {
+           nonbitmap = invlist_union(nonbitmap, l1_fold_invlist);
+       }
+       else {
+           nonbitmap = l1_fold_invlist;
+       }
+    }
+
     /* Here, we have calculated what code points should be in the character
      * class.   Now we can see about various optimizations.  Fold calculation
      * needs to take place before inversion.  Otherwise /[^k]/i would invert to
@@ -10109,7 +10268,7 @@ parseit:
     /* Folding in the bitmap is taken care of above, but not for locale (for
      * which we have to wait to see what folding is in effect at runtime), and
      * for things not in the bitmap.  Set run-time fold flag for these */
-    if (FOLD && (LOC || nonbitmap)) {
+    if (FOLD && (LOC || nonbitmap || unicode_alternate)) {
        ANYOF_FLAGS(ret) |= ANYOF_LOC_NONBITMAP_FOLD;
     }
 
@@ -10128,6 +10287,7 @@ parseit:
      * wouldn't work because they are part of the fold of 'LATIN SMALL LIGATURE
      * FI'. */
     if (! nonbitmap
+       && ! unicode_alternate
        && SvCUR(listsv) == initial_listsv_len
        && ! (ANYOF_FLAGS(ret) & (ANYOF_INVERT|ANYOF_UNICODE_ALL))
         && (((stored == 1 && ((! (ANYOF_FLAGS(ret) & ANYOF_LOCALE))
@@ -10255,6 +10415,9 @@ parseit:
        av_store(av, 0, listsv);
        av_store(av, 1, NULL);
        av_store(av, 2, MUTABLE_SV(unicode_alternate));
+       if (unicode_alternate) { /* This node is variable length */
+           OP(ret) = ANYOFV;
+       }
        rv = newRV_noinc(MUTABLE_SV(av));
        n = add_data(pRExC_state, 1, "s");
        RExC_rxi->data->data[n] = (void*)rv;
diff --git a/utf8.h b/utf8.h
index 7228000..b872859 100644
--- a/utf8.h
+++ b/utf8.h
@@ -424,6 +424,8 @@ Perl's extended UTF-8 means we can have start bytes up to 
FF.
 #define UNICODE_GREEK_SMALL_LETTER_FINAL_SIGMA 0x03C2
 #define UNICODE_GREEK_SMALL_LETTER_SIGMA       0x03C3
 #define GREEK_SMALL_LETTER_MU                   0x03BC
+#define GREEK_CAPITAL_LETTER_MU 0x039C /* Upper and title case of MICRON */
+#define LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS 0x0178   /* Also is title case */
 
 #define UNI_DISPLAY_ISPRINT    0x0001
 #define UNI_DISPLAY_BACKSLASH  0x0002
@@ -434,6 +436,8 @@ Perl's extended UTF-8 means we can have start bytes up to 
FF.
 #   define LATIN_SMALL_LETTER_SHARP_S  0x00DF
 #   define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0x00FF
 #   define MICRO_SIGN 0x00B5
+#   define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x00C5
+#   define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x00E5
 #endif
 
 #define ANYOF_FOLD_SHARP_S(node, input, end)   \
diff --git a/utfebcdic.h b/utfebcdic.h
index a9d5f50..a9197a9 100644
--- a/utfebcdic.h
+++ b/utfebcdic.h
@@ -298,6 +298,8 @@ EXTCONST unsigned char PL_a2e[] = { /* ASCII (iso-8859-1) 
to EBCDIC (IBM-1047) *
 #define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0xDF
 #define LATIN_SMALL_LETTER_SHARP_S 0x59
 #define MICRO_SIGN 0xA0
+#define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x0067
+#define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x0047
 
 EXTCONST unsigned char PL_e2a[] = { /* EBCDIC (IBM-1047) to ASCII (iso-8859-1) 
*/
  0x00, 0x01, 0x02, 0x03, 0x9C, 0x09, 0x86, 0x7F, 0x97, 0x8D, 0x8E, 0x0B, 0x0C, 
0x0D, 0x0E, 0x0F,
@@ -378,6 +380,8 @@ EXTCONST unsigned char PL_a2e[] = { /* ASCII (ISO8859-1) to 
EBCDIC (POSIX-BC) */
 #define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0xDF
 #define LATIN_SMALL_LETTER_SHARP_S 0x59
 #define MICRO_SIGN 0xA0
+#define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x0067
+#define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x0047
 
 EXTCONST unsigned char PL_e2a[] = { /* EBCDIC (POSIX-BC) to ASCII (ISO8859-1) 
*/
  0x00, 0x01, 0x02, 0x03, 0x9C, 0x09, 0x86, 0x7F, 0x97, 0x8D, 0x8E, 0x0B, 0x0C, 
0x0D, 0x0E, 0x0F,
@@ -459,6 +463,8 @@ EXTCONST unsigned char PL_a2e[] = { /* ASCII (ISO8859-1) to 
EBCDIC (IBM-037) */
 #define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0xDF
 #define LATIN_SMALL_LETTER_SHARP_S 0x59
 #define MICRO_SIGN 0xA0
+#define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x0067
+#define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x0047
 
 EXTCONST unsigned char PL_e2a[] = { /* EBCDIC (IBM-037) to ASCII (ISO8859-1) */
  0x00, 0x01, 0x02, 0x03, 0x9C, 0x09, 0x86, 0x7F, 0x97, 0x8D, 0x8E, 0x0B, 0x0C, 
0x0D, 0x0E, 0x0F,

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.13.10-129-gd50a4f9

Reply via email to