In perl.git, the branch blead has been updated <https://perl5.git.perl.org/perl.git/commitdiff/2a614cdcffdf336bc49e524a4ac3af94df7d4d00?hp=a1f354d3762aa87a796bc3d383629bfc853472f5>
- Log ----------------------------------------------------------------- commit 2a614cdcffdf336bc49e524a4ac3af94df7d4d00 Author: Karl Williamson <[email protected]> Date: Fri Feb 1 08:48:20 2019 -0700 regen/unicode_constants.pl: generate UTF-8 for U+307 This will be needed in a future commit commit e0bfe19f1cff16db3441822a6812a07ca124c861 Author: Karl Williamson <[email protected]> Date: Fri Feb 1 08:29:51 2019 -0700 t/loc_tools.pl: Add fcn to return all UTF-8 locales This will be needed in future commits commit 79ba27676437312e9dd6ce7ea8a47676cb57e5fc Author: Karl Williamson <[email protected]> Date: Fri Feb 1 11:45:34 2019 -0700 pp.c: White-space only Indent block newly formed in the previous commit commit dbb3849a8c02c652b48b25b770fc36b743b162db Author: Karl Williamson <[email protected]> Date: Fri Feb 1 11:43:10 2019 -0700 pp.c: Avoid use of unsafe function The function is unsafe because it doesn't check for running off the end of the buffer if presented with illegal UTF-8. The only remaining use now is from mathoms.c. commit 02601e33951e916a19e46272146a0b59862aaff5 Author: Karl Williamson <[email protected]> Date: Fri Feb 1 11:41:14 2019 -0700 pp.c: Add branch prediction hint This conditional is very rarely true commit 2f8f985a27faf25c5a535cbe67d098690668c0f9 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 11:24:12 2019 -0700 pp.c: Don't assume worst case memory needs Since 5.28, there has been a function that will calculate the expansion of a string when converted into UTF-8, using per-word operations. This means it runs 8 times faster than doing this count previously would have taken. I've come to believe it is better to calculate how much memory we need than to overallocate based on worst-case scenarios. This is because in very large strings, over allocating can lead to unnecessary inefficient processing. This commit changes several instances in pp.c where a string needs to be converted to UTF-8 to not assume the worst case, but instead calculate what's needed using the faster function. commit 78ed8e3629d58d11345e4367dbe14b9603e8c84b Author: Karl Williamson <[email protected]> Date: Wed Jan 30 11:09:01 2019 -0700 pp.c: Don't use function call for easy copy Like the previous commit, this code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special. commit 93327b758a54c8e1ff7ee137a513caff4d077a7d Author: Karl Williamson <[email protected]> Date: Wed Jan 30 10:52:41 2019 -0700 pp.c: Don't use function call for easy copy This code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special. commit 526f8cbff8ce0a6402d8eb64ac3970e48c8716c3 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 10:35:21 2019 -0700 pp.c: pp_fc(): Simplify The function being called does everything that the code being eliminated here did. We just pass the function the final destination instead of a temporary. commit a8e41ef404b996cb8f50be6cce716145ac4a3f67 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 10:27:17 2019 -0700 pp.c: White-space, comments only commit ca62a7c2ce92965c24def9ea277e9ad42ea797d1 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 10:02:35 2019 -0700 pp.c: Reorder clause order in an 'if' This makes the test most likely to fail be first, and adding an UNLIKELY() to it, thus saving a conditional in most instances. commit df7d4938c6907db4b8030fd133ca9d55e1e44a0d Author: Karl Williamson <[email protected]> Date: Tue Jan 29 22:02:59 2019 -0700 pp.c: Use faster method to convert to UTF-8 There is a special inline function that's used when converting a single byte to UTF-8, that is faster than the more general one used prior to this commit. commit f4cd1cd9e8d271b135a75b4b6fd817fa758c112a Author: Karl Williamson <[email protected]> Date: Tue Jan 29 22:01:18 2019 -0700 pp.c: Add missing assert The comments say there is an assert, but it wasn't there. commit 1c4079115ad9f58e29e98bd09de8772737e77be5 Author: Karl Williamson <[email protected]> Date: Mon Feb 4 16:02:35 2019 -0700 t/op/lc.t: Add 'use strict' commit 5583386ecf7417b7a05ab2f75b7284e6c90079fa Author: Karl Williamson <[email protected]> Date: Tue Jan 29 22:25:03 2019 -0700 t/re/fold_grind.pl: White-space only Just align some logical or clauses for readability. commit 247985d477048e4fea858e98efd13e728744b370 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 09:08:13 2019 -0700 handy.h: Add comment commit 5a10328cd52e3a7a3fa9244dbc367ee439850cab Author: Karl Williamson <[email protected]> Date: Fri Jan 25 09:55:58 2019 -0700 handy.h: White-space only Vertically align the ternary colon with the question mark above it. commit 9d3980bc229750e6c07726fe529f02bf4dc6a5a5 Author: Karl Williamson <[email protected]> Date: Wed Jan 23 15:42:35 2019 -0700 handy.h: Add void * casts to memEQ, memNE This change is to allow these macros to be called without having to do casting in the call. commit 813cfad2cc5a494533659beaa4833ff222b4e131 Author: Karl Williamson <[email protected]> Date: Wed Jan 30 15:00:30 2019 -0700 regcomp.c: Fix recent optimization of [...] bug This bug was introduced in b2296192536090829ba6d2cb367456f4e346dcc6 n 5.29.7. Using /il should not result in looking for a [:posix:] class that matches the code points given. ----------------------------------------------------------------------- Summary of changes: embed.fnc | 2 +- embed.h | 2 +- handy.h | 10 +- invlist_inline.h | 2 +- pp.c | 239 +++++++++++++++++++++++++-------------------- proto.h | 2 +- regcomp.c | 2 +- regen/unicode_constants.pl | 4 +- t/loc_tools.pl | 24 ++++- t/op/lc.t | 7 +- t/re/anyof.t | 1 + t/re/fold_grind.pl | 6 +- unicode_constants.h | 12 +-- 13 files changed, 178 insertions(+), 135 deletions(-) diff --git a/embed.fnc b/embed.fnc index d311ca7f51..c7816d531c 100644 --- a/embed.fnc +++ b/embed.fnc @@ -1748,7 +1748,7 @@ EXp |SV* |_core_swash_init|NN const char* pkg|NN const char* name \ |NN SV* listsv|I32 minbits|I32 none \ |NULLOK SV* invlist|NULLOK U8* const flags_p #endif -#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) +#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C) EiMRn |UV* |invlist_array |NN SV* const invlist EiMRn |bool |is_invlist |NN SV* const invlist EiMRn |bool* |get_invlist_offset_addr|NN SV* invlist diff --git a/embed.h b/embed.h index f3b95eadbd..149f1bee25 100644 --- a/embed.h +++ b/embed.h @@ -1249,7 +1249,7 @@ #endif #define regprop(a,b,c,d,e) Perl_regprop(aTHX_ a,b,c,d,e) # endif -# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) +# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C) #define _get_swash_invlist(a) Perl__get_swash_invlist(aTHX_ a) #define _invlist_contains_cp S__invlist_contains_cp #define _invlist_len S__invlist_len diff --git a/handy.h b/handy.h index d2a7801a25..954b9caa30 100644 --- a/handy.h +++ b/handy.h @@ -507,8 +507,8 @@ based on the underlying C library functions): #define strnNE(s1,s2,l) (strncmp(s1,s2,l) != 0) #define strnEQ(s1,s2,l) (strncmp(s1,s2,l) == 0) -#define memNE(s1,s2,l) (memcmp(s1,s2,l) != 0) -#define memEQ(s1,s2,l) (memcmp(s1,s2,l) == 0) +#define memEQ(s1,s2,l) (memcmp(((const void *) (s1)), ((const void *) (s2)), l) == 0) +#define memNE(s1,s2,l) (! memEQ(s1,s2,l)) /* memEQ and memNE where second comparand is a string constant */ #define memEQs(s1, l, s2) \ @@ -1540,12 +1540,14 @@ END_EXTERN_C || (char)(c) == '_')) /* These next three are also for internal core Perl use only: case-change - * helper macros */ + * helper macros. The reason for using the PL_latin arrays is in case the + * system function is defective; it ensures uniform results that conform to the + * Unicod standard. */ #define _generic_toLOWER_LC(c, function, cast) (! FITS_IN_8_BITS(c) \ ? (c) \ : (IN_UTF8_CTYPE_LOCALE) \ ? PL_latin1_lc[ (U8) (c) ] \ - : (cast)function((cast)(c))) + : (cast)function((cast)(c))) /* Note that the result can be larger than a byte in a UTF-8 locale. It * returns a single value, so can't adequately return the upper case of LATIN diff --git a/invlist_inline.h b/invlist_inline.h index cd002cef19..1304b4543a 100644 --- a/invlist_inline.h +++ b/invlist_inline.h @@ -9,7 +9,7 @@ #ifndef PERL_INVLIST_INLINE_H_ #define PERL_INVLIST_INLINE_H_ -#if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) +#if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_PP_C) /* An element is in an inversion list iff its index is even numbered: 0, 2, 4, * etc */ diff --git a/pp.c b/pp.c index 880f266081..522e985931 100644 --- a/pp.c +++ b/pp.c @@ -28,12 +28,10 @@ #include "perl.h" #include "keywords.h" +#include "invlist_inline.h" #include "reentr.h" #include "regcharclass.h" -static const STRLEN small_mu_len = sizeof(GREEK_SMALL_LETTER_MU_UTF8) - 1; -static const STRLEN capital_iota_len = sizeof(GREEK_CAPITAL_LETTER_IOTA_UTF8) - 1; - /* variations on pp_null */ PP(pp_stub) @@ -364,7 +362,7 @@ PP(pp_rv2cv) cv = SvTYPE(SvRV(gv)) == SVt_PVCV ? MUTABLE_CV(SvRV(gv)) : MUTABLE_CV(gv); - } + } else cv = MUTABLE_CV(&PL_sv_undef); SETs(MUTABLE_SV(cv)); @@ -670,7 +668,7 @@ PP(pp_study) PP(pp_trans) { - dSP; + dSP; SV *sv; if (PL_op->op_flags & OPf_STACKED) @@ -1161,18 +1159,18 @@ PP(pp_pow) else if (result <= (UV)IV_MAX) /* answer negative, fits in IV */ SETi( -(IV)result ); - else if (result == (UV)IV_MIN) + else if (result == (UV)IV_MIN) /* 2's complement assumption: special case IV_MIN */ SETi( IV_MIN ); else /* answer negative, doesn't fit */ SETn( -(NV)result ); RETURN; - } + } } } float_it: -#endif +#endif { NV right = SvNV_nomg(svr); NV left = SvNV_nomg(svl); @@ -1905,7 +1903,7 @@ PP(pp_subtract) UV result; UV buv; bool buvok = SvUOK(svr); - + if (buvok) buv = SvUVX(svr); else { @@ -2893,7 +2891,7 @@ PP(pp_rand) { dSP; NV value; - + if (MAXARG < 1) { EXTEND(SP, 1); @@ -3064,7 +3062,7 @@ PP(pp_oct) /* If Unicode, try to downgrade * If not possible, croak. */ SV* const tsv = sv_2mortal(newSVsv(sv)); - + SvUTF8_on(tsv); sv_utf8_downgrade(tsv, FALSE); tmps = SvPV_const(tsv, len); @@ -3539,7 +3537,7 @@ PP(pp_index) /* $lex = (index() == -1) */ sv_setsv(TARG, TOPs); } - else + else PUSHi(retval); RETURN; } @@ -3681,7 +3679,7 @@ PP(pp_crypt) #endif } -/* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level. So +/* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level. So * most comments below say UTF-8, when in fact they mean UTF-EBCDIC as well */ @@ -3747,12 +3745,15 @@ PP(pp_ucfirst) #endif } else { + #ifdef USE_LOCALE_CTYPE + _toLOWER_utf8_flags(s, s + slen, tmpbuf, &tculen, IN_LC_RUNTIME(LC_CTYPE)); #else _toLOWER_utf8_flags(s, s + slen, tmpbuf, &tculen, 0); #endif - } + + } /* we can't do in-place if the length changes. */ if (ulen != tculen) inplace = FALSE; @@ -3760,7 +3761,7 @@ PP(pp_ucfirst) } else { /* Non-zero length, non-UTF-8, Need to consider locale and if * latin1 is treated as caseless. Note that a locale takes - * precedence */ + * precedence */ ulen = 1; /* Original character is 1 byte */ tculen = 1; /* Most characters will require one byte, but this will * need to be overridden for the tricky ones */ @@ -3824,13 +3825,16 @@ PP(pp_ucfirst) inplace = FALSE; /* If the result won't fit in a byte, the entire result - * will have to be in UTF-8. Assume worst case sizing in - * conversion. (all latin1 characters occupy at most two - * bytes in utf8) */ + * will have to be in UTF-8. Allocate enough space for the + * expanded first byte, and if UTF-8, the rest of the input + * string, some or all of which may also expand to two + * bytes, plus the terminating NUL. */ if (title_ord > 255) { doing_utf8 = TRUE; convert_source_to_utf8 = TRUE; - need = slen * 2 + 1; + need = slen + + variant_under_utf8_count(s, s + slen) + + 1; /* The (converted) UTF-8 and UTF-EBCDIC lengths of all * (both) characters whose title case is above 255 is @@ -3890,13 +3894,16 @@ PP(pp_ucfirst) /* Assert tculen is 2 here because the only two characters that * get to this part of the code have 2-byte UTF-8 equivalents */ + assert(tculen == 2); *d++ = *tmpbuf; *d++ = *(tmpbuf + 1); s++; /* We have just processed the 1st char */ - for (; s < send; s++) { - d = uvchr_to_utf8(d, *s); - } + while (s < send) { + append_utf8_from_native_byte(*s, &d); + s++; + } + *d = '\0'; SvCUR_set(dest, d - (U8*)SvPVX_const(dest)); } @@ -3908,7 +3915,7 @@ PP(pp_ucfirst) } } - else { /* Neither source nor dest are in or need to be UTF-8 */ + else { /* Neither source nor dest are, nor need to be UTF-8 */ if (slen) { if (inplace) { /* in-place, only need to change the 1st char */ *d = *tmpbuf; @@ -3949,9 +3956,6 @@ PP(pp_ucfirst) return NORMAL; } -/* There's so much setup/teardown code common between uc and lc, I wonder if - it would be worth merging the two, and just having a switch outside each - of the three tight loops. There is less and less commonality though */ PP(pp_uc) { dSP; @@ -4018,6 +4022,8 @@ PP(pp_uc) const U8 *const send = s + len; U8 tmpbuf[UTF8_MAXBYTES_CASE+1]; +#define GREEK_CAPITAL_LETTER_IOTA 0x0399 +#define COMBINING_GREEK_YPOGEGRAMMENI 0x0345 /* All occurrences of these are to be moved to follow any other marks. * This is context-dependent. We may not be passed enough context to * move the iota subscript beyond all of them, but we do the best we can @@ -4034,12 +4040,16 @@ PP(pp_uc) STRLEN u; STRLEN ulen; UV uv; - if (in_iota_subscript && ! _is_utf8_mark(s)) { + if (UNLIKELY(in_iota_subscript)) { + UV cp = utf8_to_uvchr_buf(s, send, NULL); + + if (! _invlist_contains_cp(PL_utf8_mark, cp)) { - /* A non-mark. Time to output the iota subscript */ - Copy(GREEK_CAPITAL_LETTER_IOTA_UTF8, d, capital_iota_len, U8); - d += capital_iota_len; - in_iota_subscript = FALSE; + /* A non-mark. Time to output the iota subscript */ + *d++ = UTF8_TWO_BYTE_HI(GREEK_CAPITAL_LETTER_IOTA); + *d++ = UTF8_TWO_BYTE_LO(GREEK_CAPITAL_LETTER_IOTA); + in_iota_subscript = FALSE; + } } /* Then handle the current character. Get the changed case value @@ -4051,8 +4061,6 @@ PP(pp_uc) #else uv = _toUPPER_utf8_flags(s, send, tmpbuf, &ulen, 0); #endif -#define GREEK_CAPITAL_LETTER_IOTA 0x0399 -#define COMBINING_GREEK_YPOGEGRAMMENI 0x0345 if (uv == GREEK_CAPITAL_LETTER_IOTA && utf8_to_uvchr_buf(s, send, 0) == COMBINING_GREEK_YPOGEGRAMMENI) { @@ -4066,9 +4074,10 @@ PP(pp_uc) /* If someone uppercases one million U+03B0s we SvGROW() * one million times. Or we could try guessing how much to - * allocate without allocating too much. Such is life. - * See corresponding comment in lc code for another option - * */ + * allocate without allocating too much. But we can't + * really guess without examining the rest of the string. + * Such is life. See corresponding comment in lc code for + * another option */ d = o + (U8*) SvGROW(dest, min); } Copy(tmpbuf, d, ulen, U8); @@ -4077,8 +4086,8 @@ PP(pp_uc) s += u; } if (in_iota_subscript) { - Copy(GREEK_CAPITAL_LETTER_IOTA_UTF8, d, capital_iota_len, U8); - d += capital_iota_len; + *d++ = UTF8_TWO_BYTE_HI(GREEK_CAPITAL_LETTER_IOTA); + *d++ = UTF8_TWO_BYTE_LO(GREEK_CAPITAL_LETTER_IOTA); } SvUTF8_on(dest); *d = '\0'; @@ -4112,6 +4121,8 @@ PP(pp_uc) do_uni_rules: #endif for (; s < send; d++, s++) { + Size_t extra; + *d = toUPPER_LATIN1_MOD(*s); if (LIKELY(*d != LATIN_SMALL_LETTER_Y_WITH_DIAERESIS)) { continue; @@ -4130,7 +4141,7 @@ PP(pp_uc) /* uc() of this requires 2 characters, but they are * ASCII. If not enough room, grow the string */ - if (SvLEN(dest) < ++min) { + if (SvLEN(dest) < ++min) { const UV o = d - (U8*)SvPVX_const(dest); d = o + (U8*) SvGROW(dest, min); } @@ -4141,48 +4152,54 @@ PP(pp_uc) /* The other two special handling characters have their * upper cases outside the latin1 range, hence need to be - * in UTF-8, so the whole result needs to be in UTF-8. So, - * here we are somewhere in the middle of processing a - * non-UTF-8 string, and realize that we will have to convert - * the whole thing to UTF-8. What to do? There are - * several possibilities. The simplest to code is to - * convert what we have so far, set a flag, and continue on - * in the loop. The flag would be tested each time through - * the loop, and if set, the next character would be - * converted to UTF-8 and stored. But, I (khw) didn't want - * to slow down the mainstream case at all for this fairly - * rare case, so I didn't want to add a test that didn't - * absolutely have to be there in the loop, besides the - * possibility that it would get too complicated for - * optimizers to deal with. Another possibility is to just - * give up, convert the source to UTF-8, and restart the - * function that way. Another possibility is to convert - * both what has already been processed and what is yet to - * come separately to UTF-8, then jump into the loop that - * handles UTF-8. But the most efficient time-wise of the - * ones I could think of is what follows, and turned out to - * not require much extra code. */ - - /* Convert what we have so far into UTF-8, telling the + * in UTF-8, so the whole result needs to be in UTF-8. + * + * So, here we are somewhere in the middle of processing a + * non-UTF-8 string, and realize that we will have to + * convert the whole thing to UTF-8. What to do? There + * are several possibilities. The simplest to code is to + * convert what we have so far, set a flag, and continue on + * in the loop. The flag would be tested each time through + * the loop, and if set, the next character would be + * converted to UTF-8 and stored. But, I (khw) didn't want + * to slow down the mainstream case at all for this fairly + * rare case, so I didn't want to add a test that didn't + * absolutely have to be there in the loop, besides the + * possibility that it would get too complicated for + * optimizers to deal with. Another possibility is to just + * give up, convert the source to UTF-8, and restart the + * function that way. Another possibility is to convert + * both what has already been processed and what is yet to + * come separately to UTF-8, then jump into the loop that + * handles UTF-8. But the most efficient time-wise of the + * ones I could think of is what follows, and turned out to + * not require much extra code. + * + * First, calculate the extra space needed for the + * remainder of the source needing to be in UTF-8. The + * uppercase of a character below 256 occupies the same + * number of bytes as the original. Therefore, the space + * needed is the that number plus the number of characters + * that become two bytes when converted to UTF-8. */ + + extra = send - s + variant_under_utf8_count(s, send); + + /* Convert what we have so far into UTF-8, telling the * function that we know it should be converted, and to * allow extra space for what we haven't processed yet. - * Assume the worst case space requirements for converting - * what we haven't processed so far: that it will require - * two bytes for each remaining source character, plus the - * NUL at the end. This may cause the string pointer to - * move, so re-find it. */ + * + * This may cause the string pointer to move, so need to + * save and re-find it. */ len = d - (U8*)SvPVX_const(dest); SvCUR_set(dest, len); len = sv_utf8_upgrade_flags_grow(dest, SV_GMAGIC|SV_FORCE_UTF8_UPGRADE, - (send -s) * 2 + 1); + extra); d = (U8*)SvPVX(dest) + len; - /* Now process the remainder of the source, converting to - * upper and UTF-8. If a resulting byte is invariant in - * UTF-8, output it as-is, otherwise convert to UTF-8 and - * append it to the output. */ + /* Now process the remainder of the source, simultaneously + * converting to upper and UTF-8. */ for (; s < send; s++) { (void) _to_upper_title_latin1(*s, d, &len, 'S'); d += len; @@ -4270,13 +4287,15 @@ PP(pp_lc) STRLEN ulen; #ifdef USE_LOCALE_CTYPE + _toLOWER_utf8_flags(s, send, tmpbuf, &ulen, IN_LC_RUNTIME(LC_CTYPE)); #else _toLOWER_utf8_flags(s, send, tmpbuf, &ulen, 0); #endif - /* Here is where we would do context-sensitive actions. See the - * commit message for 86510fb15 for why there isn't any */ + /* Here is where we would do context-sensitive actions for the + * Greek final sigma. See the commit message for 86510fb15 for why + * there isn't any */ if (ulen > u && (SvLEN(dest) < (min += ulen - u))) { @@ -4372,7 +4391,7 @@ PP(pp_quotemeta) #ifdef USE_LOCALE_CTYPE /* In locale, we quote all non-ASCII Latin1 chars. * Otherwise use the quoting rules */ - + IN_LC_RUNTIME(LC_CTYPE) || #endif @@ -4520,52 +4539,57 @@ PP(pp_fc) #ifdef USE_LOCALE_CTYPE do_uni_folding: #endif - /* For ASCII and the Latin-1 range, there's only two troublesome - * folds, \x{DF} (\N{LATIN SMALL LETTER SHARP S}), which under full - * casefolding becomes 'ss'; and \x{B5} (\N{MICRO SIGN}), which - * under any fold becomes \x{3BC} (\N{GREEK SMALL LETTER MU}) -- + /* For ASCII and the Latin-1 range, there's two + * troublesome folds: + * \x{DF} (\N{LATIN SMALL LETTER SHARP S}), which under full + * casefolding becomes 'ss'; + * \x{B5} (\N{MICRO SIGN}), which under any fold becomes + * \x{3BC} (\N{GREEK SMALL LETTER MU}) * For the rest, the casefold is their lowercase. */ for (; s < send; d++, s++) { if (*s == MICRO_SIGN) { + Size_t extra = send - s + + variant_under_utf8_count(s, send); + /* \N{MICRO SIGN}'s casefold is \N{GREEK SMALL LETTER MU}, * which is outside of the latin-1 range. There's a couple * of ways to deal with this -- khw discusses them in * pp_lc/uc, so go there :) What we do here is upgrade what * we had already casefolded, then enter an inner loop that - * appends the rest of the characters as UTF-8. */ + * appends the rest of the characters as UTF-8. + * + * First we calculate the needed size of the upgraded dest + * beyond what's been processed already (the upgrade + * function figures that out). In UTF-8 strings, the fold case of a + * character below 256 occupies the same number of bytes as + * the original (even the Sharp S). Therefore, the space + * needed is the number of bytes remaining plus the number + * of characters that become two bytes when converted to + * UTF-8. */ + + /* Growing may move things, so have to save and recalculate + * 'd' */ len = d - (U8*)SvPVX_const(dest); SvCUR_set(dest, len); len = sv_utf8_upgrade_flags_grow(dest, SV_GMAGIC|SV_FORCE_UTF8_UPGRADE, - /* The max expansion for latin1 - * chars is 1 byte becomes 2 */ - (send -s) * 2 + 1); + extra); d = (U8*)SvPVX(dest) + len; - Copy(GREEK_SMALL_LETTER_MU_UTF8, d, small_mu_len, U8); - d += small_mu_len; + *d++ = UTF8_TWO_BYTE_HI(GREEK_SMALL_LETTER_MU); + *d++ = UTF8_TWO_BYTE_LO(GREEK_SMALL_LETTER_MU); s++; + for (; s < send; s++) { STRLEN ulen; - UV fc = _to_uni_fold_flags(*s, tmpbuf, &ulen, flags); - if UVCHR_IS_INVARIANT(fc) { - if (full_folding - && *s == LATIN_SMALL_LETTER_SHARP_S) - { - *d++ = 's'; - *d++ = 's'; - } - else - *d++ = (U8)fc; - } - else { - Copy(tmpbuf, d, ulen, U8); - d += ulen; - } + _to_uni_fold_flags(*s, d, &ulen, flags); + d += ulen; } break; } - else if (full_folding && *s == LATIN_SMALL_LETTER_SHARP_S) { + else if ( UNLIKELY(*s == LATIN_SMALL_LETTER_SHARP_S) + && full_folding) + { /* Under full casefolding, LATIN SMALL LETTER SHARP S * becomes "ss", which may require growing the SV. */ if (SvLEN(dest) < ++min) { @@ -4575,8 +4599,7 @@ PP(pp_fc) *(d)++ = 's'; *d = 's'; } - else { /* If it's not one of those two, the fold is their lower - case */ + else { /* Else, the fold is the lower case */ *d = toLOWER_LATIN1(*s); } } @@ -5387,7 +5410,7 @@ PP(pp_splice) i = -diff; while (i) dst[--i] = NULL; - + if (newlen) { Copy( tmparyval, AvARRAY(ary) + offset, newlen, SV* ); Safefree(tmparyval); @@ -5838,7 +5861,7 @@ PP(pp_split) } else { while (m < strend && !isSPACE(*m)) ++m; - } + } if (m >= strend) break; @@ -5876,7 +5899,7 @@ PP(pp_split) } else { while (s < strend && isSPACE(*s)) ++s; - } + } } } else if (RX_EXTFLAGS(rx) & RXf_START_ONLY) { @@ -6560,7 +6583,7 @@ PP(pp_lvref) } } else if (arg) { - S_localise_gv_slot(aTHX_ (GV *)arg, + S_localise_gv_slot(aTHX_ (GV *)arg, PL_op->op_private & OPpLVREF_TYPE); } else if (!(PL_op->op_private & OPpPAD_STATE)) @@ -6643,7 +6666,7 @@ PP(pp_anonconst) * for $: (OPf_STACKED ? *sp : $_[N]) * for @/%: @_[N..$#_] * - * It's equivalent to + * It's equivalent to * my $foo = $_[N]; * or * my $foo = (value-on-stack) diff --git a/proto.h b/proto.h index daf338707b..ba5623d4a2 100644 --- a/proto.h +++ b/proto.h @@ -5650,7 +5650,7 @@ PERL_CALLCONV void Perl_regprop(pTHX_ const regexp *prog, SV* sv, const regnode* #define PERL_ARGS_ASSERT_REGPROP \ assert(sv); assert(o) #endif -#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) +#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C) PERL_CALLCONV SV* Perl__get_swash_invlist(pTHX_ SV* const swash) __attribute__warn_unused_result__; #define PERL_ARGS_ASSERT__GET_SWASH_INVLIST \ diff --git a/regcomp.c b/regcomp.c index 58cb941b06..493729256a 100644 --- a/regcomp.c +++ b/regcomp.c @@ -18815,7 +18815,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, } } - if (! posixl) { + if (! (anyof_flags & ANYOF_LOCALE_FLAGS)) { PERL_UINT_FAST8_T type; SV * intersection = NULL; SV* d_invlist = NULL; diff --git a/regen/unicode_constants.pl b/regen/unicode_constants.pl index 3bddd90ff8..c3fa70a6e6 100644 --- a/regen/unicode_constants.pl +++ b/regen/unicode_constants.pl @@ -240,9 +240,7 @@ __DATA__ U+017F string U+0300 string - -U+0399 string -U+03BC string +U+0307 string U+1E9E string_skip_if_undef diff --git a/t/loc_tools.pl b/t/loc_tools.pl index 7afb7bacf6..5a4379f225 100644 --- a/t/loc_tools.pl +++ b/t/loc_tools.pl @@ -501,8 +501,8 @@ sub is_locale_utf8 ($) { # Return a boolean as to if core Perl thinks the input return $ret; } -sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl - # thinks is a UTF-8 LC_CTYPE locale. +sub find_utf8_ctype_locales (;$) { # Return the names of the locales that core + # Perl thinks are UTF-8 LC_CTYPE locales. # Optional parameter is a reference to a # list of locales to try; if omitted, this # tries all locales it can find on the @@ -510,6 +510,7 @@ sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl return unless locales_enabled('LC_CTYPE'); my $locales_ref = shift; + my @return; if (! defined $locales_ref) { @@ -518,9 +519,26 @@ sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl } foreach my $locale (@$locales_ref) { - return $locale if is_locale_utf8($locale); + push @return, $locale if is_locale_utf8($locale); } + return @return; +} + + +sub find_utf8_ctype_locale (;$) { # Return the name of a locale that core Perl + # thinks is a UTF-8 LC_CTYPE + # locale. + # Optional parameter is a reference to a + # list of locales to try; if omitted, this + # tries all locales it can find on the + # platform + my $try_locales_ref = shift; + + my @utf8_locales = find_utf8_ctype_locales($try_locales_ref); + + return $utf8_locales[0] if @utf8_locales; + return; } diff --git a/t/op/lc.t b/t/op/lc.t index 2ce65ac73c..60b966ff9f 100644 --- a/t/op/lc.t +++ b/t/op/lc.t @@ -1,4 +1,5 @@ #!./perl +use strict; # This file is intentionally encoded in latin-1. # @@ -164,9 +165,10 @@ is(uc("\x{1C5}") , "\x{1C4}", "U+01C5 uc is U+01C4"); is(uc("\x{1C6}") , "\x{1C4}", "U+01C6 uc is U+01C4, too"); # #18107: A host of bugs involving [ul]c{,first}. AMS 20021106 -$a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA. -$b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA. +my $a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA. +my $b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA. +my $c; ($c = $b) =~ s/(\w+)/lc($1)/ge; is($c , $a, "Using s///e to change case."); @@ -310,6 +312,7 @@ constantfolding # In-place lc/uc should not corrupt string buffers when given a non-utf8- # flagged thingy that stringifies to utf8 +my %h; $h{k} = bless[], "\x{3b0}\x{3b0}\x{3b0}bcde"; # U+03B0 grows with uc() # using delete marks it as TEMP, so uc-in-place is permitted like uc delete $h{k}, qr "^(?:\x{3a5}\x{308}\x{301}){3}BCDE=ARRAY\(.*\)", diff --git a/t/re/anyof.t b/t/re/anyof.t index ad0a2d9ada..32e0bae9ad 100644 --- a/t/re/anyof.t +++ b/t/re/anyof.t @@ -462,6 +462,7 @@ my @tests = ( '(?l:[\x{212A}])' => 'ANYOFL[212A]', '(?l:[\s\x{212A}])' => 'ANYOFPOSIXL[\s][1680 2000-200A 2028-2029 202F 205F 212A 3000]', '(?l:[^\S\x{202F}])' => 'ANYOFPOSIXL[^\\S][1680 2000-200A 2028-2029 205F 3000]', + '(?li:[a-z])' => 'ANYOFL{i}[a-z{utf8 locale}A-Z\x{017F}\x{212A}]', '\p{All}' => 'SANY', '\P{All}' => 'OPFAIL', diff --git a/t/re/fold_grind.pl b/t/re/fold_grind.pl index 4082bf7e32..fa775da910 100644 --- a/t/re/fold_grind.pl +++ b/t/re/fold_grind.pl @@ -667,7 +667,11 @@ foreach my $test (sort { numerically } keys %tests) { next if $pattern_above_latin1 && ! $utf8_pattern; # Our testing of 'l' uses the POSIX locale, which is ASCII-only - my $uni_semantics = $charset ne 'l' && ($utf8_target || $charset eq 'u' || $charset eq 'L' || ($charset eq 'd' && $utf8_pattern) || $charset =~ /a/); + my $uni_semantics = $charset ne 'l' && ( $utf8_target + || $charset eq 'u' + || $charset eq 'L' + || ($charset eq 'd' && $utf8_pattern) + || $charset =~ /a/); my $upgrade_pattern = ""; $upgrade_pattern = ' utf8::upgrade($p);' if ! $pattern_above_latin1 && $utf8_pattern; diff --git a/unicode_constants.h b/unicode_constants.h index d5a410fc48..b44fed5ae9 100644 --- a/unicode_constants.h +++ b/unicode_constants.h @@ -54,9 +54,7 @@ bytes. # define LATIN_SMALL_LETTER_LONG_S_UTF8 "\xC5\xBF" /* U+017F */ # define COMBINING_GRAVE_ACCENT_UTF8 "\xCC\x80" /* U+0300 */ - -# define GREEK_CAPITAL_LETTER_IOTA_UTF8 "\xCE\x99" /* U+0399 */ -# define GREEK_SMALL_LETTER_MU_UTF8 "\xCE\xBC" /* U+03BC */ +# define COMBINING_DOT_ABOVE_UTF8 "\xCC\x87" /* U+0307 */ # define LATIN_CAPITAL_LETTER_SHARP_S_UTF8 "\xE1\xBA\x9E" /* U+1E9E */ @@ -99,9 +97,7 @@ bytes. # define LATIN_SMALL_LETTER_LONG_S_UTF8 "\x8F\x73" /* U+017F */ # define COMBINING_GRAVE_ACCENT_UTF8 "\xAF\x41" /* U+0300 */ - -# define GREEK_CAPITAL_LETTER_IOTA_UTF8 "\xB3\x68" /* U+0399 */ -# define GREEK_SMALL_LETTER_MU_UTF8 "\xB4\x70" /* U+03BC */ +# define COMBINING_DOT_ABOVE_UTF8 "\xAF\x48" /* U+0307 */ # define LATIN_CAPITAL_LETTER_SHARP_S_UTF8 "\xBF\x63\x72" /* U+1E9E */ @@ -144,9 +140,7 @@ bytes. # define LATIN_SMALL_LETTER_LONG_S_UTF8 "\x8E\x72" /* U+017F */ # define COMBINING_GRAVE_ACCENT_UTF8 "\xAD\x41" /* U+0300 */ - -# define GREEK_CAPITAL_LETTER_IOTA_UTF8 "\xB2\x67" /* U+0399 */ -# define GREEK_SMALL_LETTER_MU_UTF8 "\xB3\x6A" /* U+03BC */ +# define COMBINING_DOT_ABOVE_UTF8 "\xAD\x48" /* U+0307 */ # define LATIN_CAPITAL_LETTER_SHARP_S_UTF8 "\xBF\x62\x71" /* U+1E9E */ -- Perl5 Master Repository
