In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/36b347b9c8caf3680cadf3c3cb98018a41a92507?hp=c23a69e83f410b43450b9f1e9696d04795aae758>
- Log ----------------------------------------------------------------- commit 36b347b9c8caf3680cadf3c3cb98018a41a92507 Author: Karl Williamson <[email protected]> Date: Tue Jul 30 11:27:20 2013 -0600 regcomp.c: Remove extraneous debug info Prior to this commit the prhase {unicode} was emitted to mark what a bracketed character class matched that wasn't in that classes bitmap. This was oftern accompanied by another phrase that gave further details. Since everything is {unicode}, the first phrase isn't very helpful. Now it is changed to {utf8} for those things that won't match unless the target string is in utf8 (this includes some upper latin1 code points under /d matches), or {outside bitmap} for where utf8 isn't necessasily required (this happens for user-defined Unicode properties that aren't known at compile time). M regcomp.c commit b60fb2ed2680b5d8e070d120472a5888810bbd95 Author: Karl Williamson <[email protected]> Date: Tue Jul 30 11:21:11 2013 -0600 regcomp.c: White-space only Outdent code which the previous commit removed from a block. M regcomp.c commit 89d3fa0ee43d5c7489581a62b3d662c316bfcb43 Author: Karl Williamson <[email protected]> Date: Tue Jul 30 11:16:56 2013 -0600 regcomp.c: Remove redundant code This code is redundant, attempting to output what isn't returned in 'lv' by reglcass_swash(), but that function makes sure that 'lv' contains everything it should, so we ended up processing (and outputting) the same data twice. M regcomp.c commit a498e2f8b43c486d265d1afa2d1ca923f13325c2 Author: Karl Williamson <[email protected]> Date: Tue Jul 30 10:51:35 2013 -0600 regexec.c: Add, clarify comments M regexec.c commit 186551534c7256e97240abd050bead0f17f28fdc Author: Karl Williamson <[email protected]> Date: Sat Jul 27 19:10:27 2013 -0600 regcomp.c: Change Debug output of char classes This commit causes the debug output that was formerly "\x4ff", for example to be \x{4f}f. It always puts braces around the hex to separate it from other characters. M regcomp.c commit f202c207b253f1f19a37243471f35742d6bb309c Author: Karl Williamson <[email protected]> Date: Sat Jul 27 18:45:18 2013 -0600 regcomp.c: Debug output clearer ranges It's not immediately obvious what the character class [!-~] matches. Better is its equivalent: [\x21-\x7e]. This commit changes the debug output to be the latter for character class matches, while retaining the current behavior where it is clear what the range matches, in, e.g., [J-R]. Ranges like [A-z] include more than just alphabetics, so they are now output as [\x41-\x7a]. (Debug output is done, for example, when the command line option -Dr is specified.) M regcomp.c commit 9a1ec8a9cfbf5741c8f41cdf8d1f5fe0c3600696 Author: Karl Williamson <[email protected]> Date: Sat Jul 27 18:21:40 2013 -0600 regcomp.c: White-space only This indents properly to correspond to a newly formed block M regcomp.c commit e89035d52414c485845a8b299343063fa14a2253 Author: Karl Williamson <[email protected]> Date: Sat Jul 27 18:19:29 2013 -0600 regcomp.c: Change debug output to use \t, etc instead of hex It is easier to read the standard abbreviations \t, \n, etc than the hex equivalents, \x09, ... M regcomp.c commit 3ae1b3845fd924a5615289bd6a44ed109508f93f Author: Karl Williamson <[email protected]> Date: Sat Jul 27 18:14:12 2013 -0600 regcomp.c: Extract duplicated code into single fcn This code that appears twice is nearly duplicate. M embed.fnc M embed.h M proto.h M regcomp.c commit 5b34bd203cb394f6ec83a0cb41cf11c7b3937f77 Author: Karl Williamson <[email protected]> Date: Tue Jul 30 11:39:02 2013 -0600 Regen t/porting/known_pod_issues.dat. This silences a warning that shows up under pedantic mode M t/porting/known_pod_issues.dat commit 1fa1147c2c277eb9a22c643f21b08da9bc15510e Author: Karl Williamson <[email protected]> Date: Fri Jul 26 14:26:27 2013 -0600 regcomp.c: Fix potential scalar leak The lines in this code were reversed. We need to check something before overwriting it, rather than the other way around. The result would be that under certain circumstances a SV would not get freed. Those circumstances are very limited: the first of the three parameters to this function is not empty, but the 2nd is, and the output (3rd parameter) is to overwrite the 2nd. I found this bug by code reading; I have searched the code space and there are no current calls to it that have this parameter configuration, therefore there is no test that can be added to trigger it. M regcomp.c commit 15c0b4aa89b8fe74b8f5e4712967bbb49292b95e Author: Karl Williamson <[email protected]> Date: Wed Jul 24 20:02:40 2013 -0600 regcomp.c: Change #ifdef This function is currently #ifdef'd out. Change it so that enabling it in embed.fnc automatically enables it here as well, making a 2 step process into just a single step. M regcomp.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 1 + embed.h | 1 + proto.h | 6 + regcomp.c | 242 +++++++++++++++++++++-------------------- regexec.c | 21 ++-- t/porting/known_pod_issues.dat | 3 +- 6 files changed, 145 insertions(+), 129 deletions(-) diff --git a/embed.fnc b/embed.fnc index a85b8a6..e4cb24d 100644 --- a/embed.fnc +++ b/embed.fnc @@ -2063,6 +2063,7 @@ Es |const regnode*|dumpuntil|NN const regexp *r|NN const regnode *start \ |NULLOK const regnode *plast \ |NN SV* sv|I32 indent|U32 depth Es |void |put_byte |NN SV* sv|int c +Es |bool |put_latin1_charclass_innards|NN SV* sv|NN char* bitmap Es |void |dump_trie |NN const struct _reg_trie_data *trie\ |NULLOK HV* widecharmap|NN AV *revcharmap\ |U32 depth diff --git a/embed.h b/embed.h index d755269..94f4c15 100644 --- a/embed.h +++ b/embed.h @@ -882,6 +882,7 @@ #define dump_trie_interim_table(a,b,c,d,e) S_dump_trie_interim_table(aTHX_ a,b,c,d,e) #define dumpuntil(a,b,c,d,e,f,g,h) S_dumpuntil(aTHX_ a,b,c,d,e,f,g,h) #define put_byte(a,b) S_put_byte(aTHX_ a,b) +#define put_latin1_charclass_innards(a,b) S_put_latin1_charclass_innards(aTHX_ a,b) #define regdump_extflags(a,b) S_regdump_extflags(aTHX_ a,b) #define regdump_intflags(a,b) S_regdump_intflags(aTHX_ a,b) #define regtail_study(a,b,c,d) S_regtail_study(aTHX_ a,b,c,d) diff --git a/proto.h b/proto.h index 15ec073..8599884 100644 --- a/proto.h +++ b/proto.h @@ -5217,6 +5217,12 @@ STATIC void S_put_byte(pTHX_ SV* sv, int c) #define PERL_ARGS_ASSERT_PUT_BYTE \ assert(sv) +STATIC bool S_put_latin1_charclass_innards(pTHX_ SV* sv, char* bitmap) + __attribute__nonnull__(pTHX_1) + __attribute__nonnull__(pTHX_2); +#define PERL_ARGS_ASSERT_PUT_LATIN1_CHARCLASS_INNARDS \ + assert(sv); assert(bitmap) + STATIC void S_regdump_extflags(pTHX_ const char *lead, const U32 flags); STATIC void S_regdump_intflags(pTHX_ const char *lead, const U32 flags); STATIC U8 S_regtail_study(pTHX_ struct RExC_state_t *pRExC_state, regnode *p, const regnode *val, U32 depth) diff --git a/regcomp.c b/regcomp.c index 0af3483..8c7a6f8 100644 --- a/regcomp.c +++ b/regcomp.c @@ -7844,11 +7844,11 @@ Perl__invlist_intersection_maybe_complement_2nd(pTHX_ SV* const a, SV* const b, * must be every possible code point. Thus the intersection is * simply 'a'. */ if (*i != a) { - *i = invlist_clone(a); - if (*i == b) { SvREFCNT_dec_NN(b); } + + *i = invlist_clone(a); } /* else *i is already 'a' */ return; @@ -8312,7 +8312,7 @@ Perl__invlist_dump(pTHX_ SV* const invlist, const char * const header) } #endif -#if 0 +#ifdef PERL_ARGS_ASSERT__INVLISTEQ bool S__invlistEQ(pTHX_ SV* const a, SV* const b, const bool complement_b) { @@ -14685,26 +14685,10 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o) ) ); if ( IS_ANYOF_TRIE(op) || trie->bitmap ) { - int i; - int rangestart = -1; - U8* bitmap = IS_ANYOF_TRIE(op) ? (U8*)ANYOF_BITMAP(o) : (U8*)TRIE_BITMAP(trie); sv_catpvs(sv, "["); - for (i = 0; i <= 256; i++) { - if (i < 256 && BITMAP_TEST(bitmap,i)) { - if (rangestart == -1) - rangestart = i; - } else if (rangestart != -1) { - if (i <= rangestart + 3) - for (; rangestart < i; rangestart++) - put_byte(sv, rangestart); - else { - put_byte(sv, rangestart); - sv_catpvs(sv, "-"); - put_byte(sv, i - 1); - } - rangestart = -1; - } - } + (void) put_latin1_charclass_innards(sv, IS_ANYOF_TRIE(op) + ? ANYOF_BITMAP(o) + : TRIE_BITMAP(trie)); sv_catpvs(sv, "]"); } @@ -14748,7 +14732,6 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o) } else if (k == LOGICAL) Perl_sv_catpvf(aTHX_ sv, "[%d]", o->flags); /* 2: embedded, otherwise 1 */ else if (k == ANYOF) { - int i, rangestart = -1; const U8 flags = ANYOF_FLAGS(o); int do_sep = 0; @@ -14762,32 +14745,19 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o) sv_catpvs(sv, "^"); /* output what the standard cp 0-255 bitmap matches */ - for (i = 0; i <= 256; i++) { - if (i < 256 && ANYOF_BITMAP_TEST(o,i)) { - if (rangestart == -1) - rangestart = i; - } else if (rangestart != -1) { - if (i <= rangestart + 3) - for (; rangestart < i; rangestart++) - put_byte(sv, rangestart); - else { - put_byte(sv, rangestart); - sv_catpvs(sv, "-"); - put_byte(sv, i - 1); - } - do_sep = 1; - rangestart = -1; - } - } + do_sep = put_latin1_charclass_innards(sv, ANYOF_BITMAP(o)); EMIT_ANYOF_TEST_SEPARATOR(do_sep,sv,flags); /* output any special charclass tests (used entirely under use locale) */ - if (ANYOF_CLASS_TEST_ANY_SET(o)) - for (i = 0; i < (int)(sizeof(anyofs)/sizeof(char*)); i++) + if (ANYOF_CLASS_TEST_ANY_SET(o)) { + int i; + for (i = 0; i < (int)(sizeof(anyofs)/sizeof(char*)); i++) { if (ANYOF_CLASS_TEST(o,i)) { sv_catpv(sv, anyofs[i]); do_sep = 1; } + } + } EMIT_ANYOF_TEST_SEPARATOR(do_sep,sv,flags); @@ -14798,91 +14768,61 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o) /* output information about the unicode matching */ if (flags & ANYOF_UNICODE_ALL) sv_catpvs(sv, "{unicode_all}"); - else if (ANYOF_NONBITMAP(o)) - sv_catpvs(sv, "{unicode}"); - if (flags & ANYOF_NONBITMAP_NON_UTF8) - sv_catpvs(sv, "{outside bitmap}"); + else if (ANYOF_NONBITMAP(o)) { + SV *lv; /* Set if there is something outside the bit map. */ + + if (flags & ANYOF_NONBITMAP_NON_UTF8) { + sv_catpvs(sv, "{outside bitmap}"); + } + else { + sv_catpvs(sv, "{utf8}"); + } - if (ANYOF_NONBITMAP(o)) { - SV *lv; /* Set if there is something outside the bit map */ + /* Get the stuff that wasn't in the bitmap */ SV * const sw = regclass_swash(prog, o, FALSE, &lv, NULL); bool byte_output = FALSE; /* If something in the bitmap has been output */ - if (lv && lv != &PL_sv_undef) { - if (sw) { - U8 s[UTF8_MAXBYTES_CASE+1]; - - for (i = 0; i <= 256; i++) { /* Look at chars in bitmap */ - uvchr_to_utf8(s, i); - - if (i < 256 - && ! ANYOF_BITMAP_TEST(o, i) /* Don't duplicate - things already - output as part - of the bitmap */ - && swash_fetch(sw, s, TRUE)) - { - if (rangestart == -1) - rangestart = i; - } else if (rangestart != -1) { - byte_output = TRUE; - if (i <= rangestart + 3) - for (; rangestart < i; rangestart++) { - put_byte(sv, rangestart); - } - else { - put_byte(sv, rangestart); - sv_catpvs(sv, "-"); - put_byte(sv, i-1); - } - rangestart = -1; - } - } - } + char *s = savesvpv(lv); + char * const origs = s; - { - char *s = savesvpv(lv); - char * const origs = s; - - while (*s && *s != '\n') - s++; + while (*s && *s != '\n') + s++; - if (*s == '\n') { - const char * const t = ++s; + if (*s == '\n') { + const char * const t = ++s; - if (byte_output) { - sv_catpvs(sv, " "); - } + if (byte_output) { + sv_catpvs(sv, " "); + } - while (*s) { - if (*s == '\n') { + while (*s) { + if (*s == '\n') { - /* Truncate very long output */ - if (s - origs > 256) { - Perl_sv_catpvf(aTHX_ sv, - "%.*s...", - (int) (s - origs - 1), - t); - goto out_dump; - } - *s = ' '; - } - else if (*s == '\t') { - *s = '-'; - } - s++; - } - if (s[-1] == ' ') - s[-1] = 0; + /* Truncate very long output */ + if (s - origs > 256) { + Perl_sv_catpvf(aTHX_ sv, + "%.*s...", + (int) (s - origs - 1), + t); + goto out_dump; + } + *s = ' '; + } + else if (*s == '\t') { + *s = '-'; + } + s++; + } + if (s[-1] == ' ') + s[-1] = 0; - sv_catpv(sv, t); - } + sv_catpv(sv, t); + } - out_dump: + out_dump: - Safefree(origs); - } + Safefree(origs); SvREFCNT_dec_NN(lv); } } @@ -15511,12 +15451,17 @@ S_put_byte(pTHX_ SV *sv, int c) So the old condition can be simplified to !isPRINT(c) */ if (!isPRINT(c)) { - if (c < 256) { - Perl_sv_catpvf(aTHX_ sv, "\\x%02x", c); - } - else { - Perl_sv_catpvf(aTHX_ sv, "\\x{%x}", c); - } + switch (c) { + case '\r': Perl_sv_catpvf(aTHX_ sv, "\\r"); break; + case '\n': Perl_sv_catpvf(aTHX_ sv, "\\n"); break; + case '\t': Perl_sv_catpvf(aTHX_ sv, "\\t"); break; + case '\f': Perl_sv_catpvf(aTHX_ sv, "\\f"); break; + case '\a': Perl_sv_catpvf(aTHX_ sv, "\\a"); break; + + default: + Perl_sv_catpvf(aTHX_ sv, "\\x{%x}", c); + break; + } } else { const char string = c; @@ -15526,6 +15471,63 @@ S_put_byte(pTHX_ SV *sv, int c) } } +STATIC bool +S_put_latin1_charclass_innards(pTHX_ SV *sv, char *bitmap) +{ + /* Appends to 'sv' a displayable version of the innards of the bracketed + * character class whose bitmap is 'bitmap'; Returns 'TRUE' if it actually + * output anything */ + + int i; + int rangestart = -1; + bool has_output_anything = FALSE; + + PERL_ARGS_ASSERT_PUT_LATIN1_CHARCLASS_INNARDS; + + for (i = 0; i <= 256; i++) { + if (i < 256 && BITMAP_TEST((U8 *) bitmap,i)) { + if (rangestart == -1) + rangestart = i; + } else if (rangestart != -1) { + int j = i - 1; + if (i <= rangestart + 3) { /* Individual chars in short ranges */ + for (; rangestart < i; rangestart++) + put_byte(sv, rangestart); + } + else if ( j > 255 + || ! isALPHANUMERIC(rangestart) + || ! isALPHANUMERIC(j) + || isDIGIT(rangestart) != isDIGIT(j) + || isUPPER(rangestart) != isUPPER(j) + || isLOWER(rangestart) != isLOWER(j) + + /* This final test should get optimized out except + * on EBCDIC platforms, where it causes ranges that + * cross discontinuities like i/j to be shown as hex + * instead of the misleading, e.g. H-K (since that + * range includes more than H, I, J, K). */ + || (j - rangestart) + != NATIVE_TO_ASCII(j) - NATIVE_TO_ASCII(rangestart)) + { + Perl_sv_catpvf(aTHX_ sv, "\\x{%02x}-\\x{%02x}", + rangestart, + (j < 256) ? j : 255); + } + else { /* Here, the ends of the range are both digits, or both + uppercase, or both lowercase; and there's no + discontinuity in the range (which could happen on EBCDIC + platforms) */ + put_byte(sv, rangestart); + sv_catpvs(sv, "-"); + put_byte(sv, j); + } + rangestart = -1; + has_output_anything = TRUE; + } + } + + return has_output_anything; +} #define CLEAR_OPTSTART \ if (optstart) STMT_START { \ diff --git a/regexec.c b/regexec.c index 04491ee..4a350fb 100644 --- a/regexec.c +++ b/regexec.c @@ -7271,15 +7271,18 @@ STATIC SV * S_core_regclass_swash(pTHX_ const regexp *prog, const regnode* node, bool doinit, SV** listsvp) { /* Returns the swash for the input 'node' in the regex 'prog'. - * If <doinit> is true, will attempt to create the swash if not already + * If <doinit> is 'true', will attempt to create the swash if not already * done. - * If <listsvp> is non-null, will return the swash initialization string in - * it. + * If <listsvp> is non-null, will return the printable contents of the + * swash. This can be used to get debugging information even before the + * swash exists, by calling this function with 'doinit' set to false, in + * which case the components that will be used to eventually create the + * swash are returned (in a printable form). * Tied intimately to how regcomp.c sets up the data structure */ dVAR; SV *sw = NULL; - SV *si = NULL; + SV *si = NULL; /* Input swash initialization string */ SV* invlist = NULL; RXi_GET_DECL(prog,progi); @@ -7332,16 +7335,18 @@ S_core_regclass_swash(pTHX_ const regexp *prog, const regnode* node, bool doinit } } + /* If requested, return a printable version of what this swash matches */ if (listsvp) { SV* matches_string = newSVpvn("", 0); - /* Use the swash, if any, which has to have incorporated into it all - * possibilities */ + /* The swash should be used, if possible, to get the data, as it + * contains the resolved data. But this function can be called at + * compile-time, before everything gets resolved, in which case we + * return the currently best available information, which is the string + * that will eventually be used to do that resolving, 'si' */ if ((! sw || (invlist = _get_swash_invlist(sw)) == NULL) && (si && si != &PL_sv_undef)) { - - /* If no swash, use the input initialization string, if available */ sv_catsv(matches_string, si); } diff --git a/t/porting/known_pod_issues.dat b/t/porting/known_pod_issues.dat index 0bd64c5..12c2b15 100644 --- a/t/porting/known_pod_issues.dat +++ b/t/porting/known_pod_issues.dat @@ -35,8 +35,8 @@ Class::PseudoHash Classic::Perl Clone cpan2dist(1) -cpanp(1) CPAN::Changes::Spec +cpanp(1) CPANPLUS Crypt::Random curl(1) @@ -230,6 +230,7 @@ pod/perldsc.pod Verbatim line length including indents exceeds 79 by 4 pod/perldtrace.pod Verbatim line length including indents exceeds 79 by 26 pod/perlebcdic.pod Verbatim line length including indents exceeds 79 by 13 pod/perlembed.pod Verbatim line length including indents exceeds 79 by 27 +pod/perlfunc.pod ? Should you be using F<...> or maybe L<...> instead of 1 pod/perlgit.pod Verbatim line length including indents exceeds 79 by 12 pod/perlgpl.pod Verbatim line length including indents exceeds 79 by 50 pod/perlguts.pod ? Should you be using F<...> or maybe L<...> instead of 2 -- Perl5 Master Repository
