In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/613abc6d16e99bd9834fe6afd79beb61a3a4734d?hp=ea5519d61c4e7f31f98e6f49013cbdadbfa26308>
- Log ----------------------------------------------------------------- commit 613abc6d16e99bd9834fe6afd79beb61a3a4734d Author: Karl Williamson <[email protected]> Date: Mon Dec 29 13:15:57 2014 -0700 Raise warning on multi-byte char in single-byte locale See http://nntp.perl.org/group/perl.perl5.porters/211909 Something is quite likely wrong with the logic if say in a Greek locale, Unicode characters (especially Greek ones) are encountered. The same character will be represented by two different code points. This warning alerts the user to this undesirable state of affairs. M perl.h M pod/perldelta.pod M pod/perldiag.pod M pod/perllocale.pod M regexec.c M t/lib/warnings/regexec M t/lib/warnings/utf8 M t/re/charset.t M t/re/pat_advanced.t M utf8.c commit dbf3c4d788344c8d20eb2549c638ced519d3f0e8 Author: Karl Williamson <[email protected]> Date: Mon Dec 29 12:57:02 2014 -0700 perllocale: Nits M pod/perllocale.pod commit d635b7101aac73db76a54016b58991ba7cd8d778 Author: Karl Williamson <[email protected]> Date: Mon Dec 29 11:01:59 2014 -0700 foldEQ_utf8(): Add some internal flags The comments explain their purpose M utf8.c M utf8.h commit 8bdce3944e3c1dd192c971851b33f718084e1942 Author: Karl Williamson <[email protected]> Date: Fri Dec 26 19:37:50 2014 -0700 lib/warnings/utf8: Add missing tests lcfirst had been overlooked, and we had failed to test that turning off warnings actually suppresses them. M t/lib/warnings/utf8 commit 1d39b2cd2a278ed0630f07bd7598726910eb6427 Author: Karl Williamson <[email protected]> Date: Fri Dec 26 18:31:04 2014 -0700 Simplify foldEQ_utf8 This moves the uncommon case of handling inputs under non-UTF-8 locales out of this function to the functions it calls, which already have the logic to handle it. This simplifies this function, cutting a couple branches each time through the loop from the common usage. The locale handling is slowed down somewhat, but even if that were a concern, another simpler function is normally used for locale handling. This gets called only when one or both of the comparison strings is UTF-8, which should be comparatively rare for non-UTF8 locales. M utf8.c commit 357aaddece5471320c7b8b94099d29e9ee5c74fb Author: Karl Williamson <[email protected]> Date: Fri Dec 26 18:20:14 2014 -0700 utf8.c: Use OP_DESC instead of passing string. OP_DESC is simpler and more general. M embed.fnc M embed.h M proto.h M utf8.c commit e7b7ac466eda00925c5668867d967e36cdfcb731 Author: Karl Williamson <[email protected]> Date: Fri Dec 26 17:47:37 2014 -0700 utf8.c: Fix potential fold bug The function _to_uni_fold_flags() supposedly had the ability to do folding based on the current locale, if the correct flag is passed. However, it didn't actually do that, returning a non-locale fold instead. Fortunately, this is an undocumented capability (actually, the whole function is undocumented), and no current calls to it used the flag. This commit causes it to work. M utf8.c commit aa8ebe624a829b26875f96527e6ba67a796bd995 Author: Karl Williamson <[email protected]> Date: Fri Dec 26 15:41:33 2014 -0700 utf8.c: Add some function parameter assertions Currently these are not violated, but this guards against future mistakes. M utf8.c commit 01f55654e76127c670a8806c68378a393e3972f4 Author: Karl Williamson <[email protected]> Date: Thu Dec 18 22:21:21 2014 -0700 regexec.c: Move goto label to avoid redundant work This causes a goto to skip work that has already been done. M regexec.c commit 8e57b935987c3b2b9702249e3565bd39d3a05f9f Author: Karl Williamson <[email protected]> Date: Fri Dec 26 16:08:28 2014 -0700 regexec.c: White-space only M regexec.c commit 780fcc9fd03dbbd16715e2b6ecd020f9e50b7cc7 Author: Karl Williamson <[email protected]> Date: Thu Dec 18 13:29:51 2014 -0700 Don't raise 'poorly supported' locale warning unnecessarily Commit 8c6180a91de91a1194f427fc639694f43a903a78 added a warning message for when Perl determines that the program's underlying locale just switched into is poorly supported. At the time it was thought that this would be an extremely rare occurrence. However, a bug in HP-UX - B.11.00/64 causes this message to be raised for the "C" locale. A workaround was done that silenced those. However, before it got fixed, this message would occur gobs of times executing the test suite. It was raised even if the script is not locale-aware, so that the underlying locale was completely irrelevant. There is a good prospect that someone using an older Asian locale as their default would get this message inappropriately, even if they don't use locales, or switch to a supported one before using them. This commit causes the message to be raised only if it actually is relevant. When not in the scope of 'use locale', the message is stored, not raised. Upon the first locale-dependent operation within a bad locale, the saved message is raised, and the storage cleared. I was able to do this without adding extra branching to the main-line non-locale execution code. This was done by adding regnodes which get jumped to by switch statements, and refactoring some existing C tests so they exclude non-locale right off the bat. These changes would have been necessary for another locale warning that I previously agreed to implement, and which is coming a few commits from now. I do not know of any way to add tests in the test suite for this. It is in fact rare for modern locales to have these issues. The way I tested this was to temporarily change the C code so that all locales are viewed as defective, and manually note that the warnings came out where expected, and only where expected. I chose not to try to output this warning on any POSIX functions called. I believe that all that are affected are deprecated or scheduled to be deprecated anyway. And POSIX is closer to the hardware of the machine. For convenience, I also don't output the message for some zero-length pattern matches. If something is going to be matched, the message will likely very soon be raised anyway. M embedvar.h M intrpvar.h M locale.c M perl.c M perl.h M pod/perldelta.pod M pod/perldiag.pod M pp.c M regexec.c M sv.c M utf8.c commit a4525e789871d3846f20d0ea7d2d239c6a21a5a4 Author: Karl Williamson <[email protected]> Date: Thu Dec 18 10:42:30 2014 -0700 Add regex nodes for locale These will be used in a future commit to distinguish between /l patterns vs non-/l. M pod/perldebguts.pod M regcomp.c M regcomp.h M regcomp.sym M regexec.c M regnodes.h commit e7fd4aa18abbfe0099d4947060c99ca85f42f764 Author: Karl Williamson <[email protected]> Date: Thu Dec 18 14:03:09 2014 -0700 regcomp.c, regexec.c: Vertically align ternary operators For clarity M regcomp.c M regexec.c commit 2f306ab9fcefa58507af5830f60ce18c43bbad18 Author: Karl Williamson <[email protected]> Date: Thu Dec 18 13:32:21 2014 -0700 Nits in comments M regcomp.c M regcomp.sym M regexec.c M utf8.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 1 - embed.h | 2 +- embedvar.h | 1 + intrpvar.h | 1 + locale.c | 35 ++++-- perl.c | 2 + perl.h | 46 ++++++++ pod/perldebguts.pod | 6 + pod/perldelta.pod | 13 +++ pod/perldiag.pod | 34 +++++- pod/perllocale.pod | 30 +++-- pp.c | 23 ++-- proto.h | 9 +- regcomp.c | 91 ++++++++++----- regcomp.h | 10 +- regcomp.sym | 6 + regexec.c | 150 ++++++++++++++++++++---- regnodes.h | 311 ++++++++++++++++++++++++++----------------------- sv.c | 3 + t/lib/warnings/regexec | 26 +++++ t/lib/warnings/utf8 | 37 ++++++ t/re/charset.t | 2 + t/re/pat_advanced.t | 1 + utf8.c | 222 ++++++++++++++++++----------------- utf8.h | 2 + 25 files changed, 717 insertions(+), 347 deletions(-) diff --git a/embed.fnc b/embed.fnc index 52ec3ee..a5b955a 100644 --- a/embed.fnc +++ b/embed.fnc @@ -2462,7 +2462,6 @@ sn |NV|mulexp10 |NV value|I32 exponent #if defined(PERL_IN_UTF8_C) sRM |UV |check_locale_boundary_crossing \ - |NN const char * const func_name \ |NN const U8* const p \ |const UV result \ |NN U8* const ustrp \ diff --git a/embed.h b/embed.h index 49d9ee8..7895e61 100644 --- a/embed.h +++ b/embed.h @@ -1744,7 +1744,7 @@ #define isa_lookup(a,b,c,d) S_isa_lookup(aTHX_ a,b,c,d) # endif # if defined(PERL_IN_UTF8_C) -#define check_locale_boundary_crossing(a,b,c,d,e) S_check_locale_boundary_crossing(aTHX_ a,b,c,d,e) +#define check_locale_boundary_crossing(a,b,c,d) S_check_locale_boundary_crossing(aTHX_ a,b,c,d) #define is_utf8_common(a,b,c,d) S_is_utf8_common(aTHX_ a,b,c,d) #define swash_scan_list_line(a,b,c,d,e,f,g) S_swash_scan_list_line(aTHX_ a,b,c,d,e,f,g) #define swatch_get(a,b,c) S_swatch_get(aTHX_ a,b,c) diff --git a/embedvar.h b/embedvar.h index 32a8b9b..da3c331 100644 --- a/embedvar.h +++ b/embedvar.h @@ -352,6 +352,7 @@ #define PL_utf8_xidstart (vTHX->Iutf8_xidstart) #define PL_utf8cache (vTHX->Iutf8cache) #define PL_utf8locale (vTHX->Iutf8locale) +#define PL_warn_locale (vTHX->Iwarn_locale) #define PL_warnhook (vTHX->Iwarnhook) #define PL_watchaddr (vTHX->Iwatchaddr) #define PL_watchok (vTHX->Iwatchok) diff --git a/intrpvar.h b/intrpvar.h index 3bb1c9a..eb96283 100644 --- a/intrpvar.h +++ b/intrpvar.h @@ -238,6 +238,7 @@ PERLVAR(I, exit_flags, U8) /* was exit() unexpected, etc. */ PERLVAR(I, utf8locale, bool) /* utf8 locale detected */ PERLVAR(I, in_utf8_CTYPE_locale, bool) +PERLVAR(I, warn_locale, SV *) PERLVARA(I, colors,6, char *) /* values from PERL_RE_COLORS env var */ diff --git a/locale.c b/locale.c index 429fdb7..2577ed2 100644 --- a/locale.c +++ b/locale.c @@ -292,6 +292,8 @@ Perl_new_ctype(pTHX_ const char *newctype) to start */ unsigned int bad_count = 0; /* Count of bad characters */ + SvREFCNT_dec(PL_warn_locale); /* We are about to overwrite this */ + for (i = 0; i < 256; i++) { if (isUPPER_LC((U8) i)) PL_fold_locale[i] = (U8) toLOWER_LC((U8) i); @@ -360,17 +362,9 @@ Perl_new_ctype(pTHX_ const char *newctype) #endif if (bad_count || multi_byte_locale) { - - /* We have to save 'newctype' because the setlocale() just below - * may destroy it. The next setlocale() further down should - * restore it properly so that the intermediate change here is - * transparent to this function's caller */ - const char * const badlocale = savepv(newctype); - - setlocale(LC_CTYPE, "C"); - Perl_warner(aTHX_ packWARN(WARN_LOCALE), + PL_warn_locale = Perl_newSVpvf(aTHX_ "Locale '%s' may not work well.%s%s%s\n", - badlocale, + newctype, (multi_byte_locale) ? " Some characters in it are not recognized by" " Perl." @@ -384,7 +378,26 @@ Perl_new_ctype(pTHX_ const char *newctype) ? bad_chars_list : "" ); - setlocale(LC_CTYPE, badlocale); + /* If we are actually in the scope of the locale, output the + * message now. Otherwise we save it to be output at the first + * operation using this locale, if that actually happens. Most + * programs don't use locales, so they are immune to bad ones */ + if (IN_LC(LC_CTYPE)) { + + /* We have to save 'newctype' because the setlocale() just + * below may destroy it. The next setlocale() further down + * should restore it properly so that the intermediate change + * here is transparent to this function's caller */ + const char * const badlocale = savepv(newctype); + + setlocale(LC_CTYPE, "C"); + + /* The '0' below suppresses a bogus gcc compiler warning */ + Perl_warner(aTHX_ packWARN(WARN_LOCALE), SvPVX(PL_warn_locale), 0); + setlocale(LC_CTYPE, badlocale); + SvREFCNT_dec_NN(PL_warn_locale); + PL_warn_locale = NULL; + } } } diff --git a/perl.c b/perl.c index be9932d..2ebc4f7 100644 --- a/perl.c +++ b/perl.c @@ -1040,6 +1040,7 @@ perl_destruct(pTHXx) SvREFCNT_dec(PL_Latin1); SvREFCNT_dec(PL_NonL1NonFinalFold); SvREFCNT_dec(PL_HasMultiCharFold); + SvREFCNT_dec(PL_warn_locale); PL_utf8_mark = NULL; PL_utf8_toupper = NULL; PL_utf8_totitle = NULL; @@ -1051,6 +1052,7 @@ perl_destruct(pTHXx) PL_AboveLatin1 = NULL; PL_InBitmap = NULL; PL_HasMultiCharFold = NULL; + PL_warn_locale = NULL; PL_Latin1 = NULL; PL_NonL1NonFinalFold = NULL; PL_UpperLatin1 = NULL; diff --git a/perl.h b/perl.h index a3f63b0..89a7d43 100644 --- a/perl.h +++ b/perl.h @@ -5779,6 +5779,48 @@ typedef struct am_table_short AMTS; # define IN_LC(category) \ (IN_LC_COMPILETIME(category) || IN_LC_RUNTIME(category)) +# if defined (PERL_CORE) || defined (PERL_IN_XSUB_RE) + + /* This internal macro should be called from places that operate under + * locale rules. It there is a problem with the current locale that + * hasn't been raised yet, it will output a warning this time */ +# define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \ + STMT_START { \ + if (PL_warn_locale) { \ + /*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \ + Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \ + SvPVX(PL_warn_locale), \ + 0 /* dummy to avoid comp warning */ ); \ + /* GCC_DIAG_RESTORE; */ \ + SvREFCNT_dec_NN(PL_warn_locale); \ + PL_warn_locale = NULL; \ + } \ + } STMT_END + + + /* These two internal macros are called when a warning should be raised, + * and will do so if enabled. The first takes a single code point + * argument; the 2nd, is a pointer to the first byte of the UTF-8 encoded + * string, and an end position which it won't try to read past */ +# define _CHECK_AND_OUTPUT_WIDE_LOCALE_CP_MSG(cp) \ + Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \ + "Wide character (U+%"UVXf") in %s", (UV) cp, OP_DESC(PL_op)); + +# define _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(s, send) \ + STMT_START { /* Check if to warn before doing the conversion work */\ + if (ckWARN(WARN_LOCALE)) { \ + UV cp = utf8_to_uvchr_buf((U8 *) s, (U8 *) send, NULL); \ + Perl_warner(aTHX_ packWARN(WARN_LOCALE), \ + "Wide character (U+%"UVXf") in %s", \ + (cp == 0) \ + ? UNICODE_REPLACEMENT \ + : (UV) cp, \ + OP_DESC(PL_op)); \ + } \ + } STMT_END + +# endif /* PERL_CORE or PERL_IN_XSUB_RE */ + #else /* No locale usage */ # define IN_LOCALE_RUNTIME 0 # define IN_SOME_LOCALE_FORM_RUNTIME 0 @@ -5793,6 +5835,10 @@ typedef struct am_table_short AMTS; # define IN_LC_COMPILETIME(category) 0 # define IN_LC_RUNTIME(category) 0 # define IN_LC(category) 0 + +# define _CHECK_AND_WARN_PROBLEMATIC_LOCALE +# define _CHECK_AND_OUTPUT_WIDE_LOCALE_CP_MSG(a) +# define _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(a,b) #endif #ifdef USE_LOCALE_NUMERIC diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod index 8b90342..57fa1f4 100644 --- a/pod/perldebguts.pod +++ b/pod/perldebguts.pod @@ -593,6 +593,7 @@ will be lost. CANY no Match any one byte. ANYOF sv 1 Match character in (or not in) this class, single char match only + ANYOFL sv 1 Like ANYOF, but /l is in effect # POSIX Character Classes: POSIXD none Some [[:class:]] under /d; the FLAGS field @@ -626,6 +627,7 @@ will be lost. # Literals EXACT str Match this string (preceded by length). + EXACTL str Like EXACT, but /l is in effect. EXACTF str Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len). EXACTFL str Match this string (not guaranteed to be @@ -635,9 +637,13 @@ will be lost. UTF-8) using /iu rules (w/len). EXACTFA str Match this string (not guaranteed to be folded) using /iaa rules (w/len). + EXACTFU_SS str Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len). + EXACTFLU8 str Rare cirucmstances: like EXACTFU, but is + under /l, UTF-8, folded, and everything in + it is above 255. EXACTFA_NO_TRIE str Match this string (which is not trie-able; not guaranteed to be folded) using /iaa rules (w/len). diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 6eecc00..6a830b9 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -209,6 +209,10 @@ XXX L<message|perldiag/"message"> XXX L<message|perldiag/"message"> +=item * + +L<Wide character (U+%X) in %s|perldiag/"Wide character (U+%X) in %s"> + =back =head2 Changes to Existing Diagnostics @@ -221,6 +225,15 @@ XXX Changes (i.e. rewording) of diagnostic messages go here XXX Describe change here +The message +L<Locale '%s' may not work well.%s|perldiag/"Locale '%s' may not work well.%s"> +is no longer raised unless the problemtatic locale is actually used in +the Perl program. Previously it was raised if it merely was the +underlying locale. All Perl programs have an underlying locale at all +times, but something like a C<S<use locale>> is needed for that locale +to actually have some effect. This message will not be raised when +the underlying locale is hidden. + =back =head1 Utility Changes diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 1c845dd..4979da2 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2982,16 +2982,16 @@ likely fix this error. =item Locale '%s' may not work well.%s -(W locale) The named locale that Perl is now trying to use is not fully -compatible with Perl. The second C<%s> gives a reason. +(W locale) You are using the named locale, which is a non-UTF-8 one, and +which Perl has determined is not fully compatible with Perl. The second +C<%s> gives a reason. By far the most common reason is that the locale has characters in it that are represented by more than one byte. The only such locales that Perl can handle are the UTF-8 locales. Most likely the specified locale is a non-UTF-8 one for an East Asian language such as Chinese or Japanese. If the locale is a superset of ASCII, the ASCII portion of it -may work in Perl. Read on for problems when it isn't a superset of -ASCII. +may work in Perl. Some essentially obsolete locales that aren't supersets of ASCII, mainly those in ISO 646 or other 7-bit locales, such as ASMO 449, can also have @@ -2999,6 +2999,18 @@ problems, depending on what portions of the ASCII character set get changed by the locale and are also used by the program. The warning message lists the determinable conflicting characters. +Note that not all incompatibilities are found. + +If this happens to you, there's not much you can do except switch to use a +different locale or use L<Encode> to translate from the locale into +UTF-8; if that's impracticable, you have been warned that some things +may break. + +This message is output once each time a bad locale is switched into +within the scope of C<S<use locale>>, or on the first possibly-affected +operation if the C<S<use locale>> inherits a bad one. It is not raised +for any operations from the L<POSIX> module. + =item localtime(%f) failed (W overflow) You called C<localtime> with a number that it could not handle: @@ -6952,6 +6964,20 @@ warning is to add C<no warnings 'utf8';> but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see L<open> and L<perlfunc/binmode>. +=item Wide character (U+%X) in %s + +(W locale) While in a single-byte locale (I<i.e.>, a non-UTF-8 +one), a multi-byte character was encountered. Perl considers this +character to be the specified Unicode code point. Combining non-UTF8 +locales and Unicode is dangerous. Almost certainly some characters +will have two different representations. For example, in the ISO 8859-7 +(Greek) locale, the code point 0xC3 represents a Capital Gamma. But so +also does 0x393. This will make string comparisons unreliable. + +You likely need to figure out how this multi-byte character got mixed up +with your single-byte locale (or perhaps you thought you had a UTF-8 +locale, but Perl disagrees). + =item Within []-length '%c' not allowed (F) The count in the (un)pack template may be replaced by C<[TEMPLATE]> diff --git a/pod/perllocale.pod b/pod/perllocale.pod index d083c09..3b2d79d 100644 --- a/pod/perllocale.pod +++ b/pod/perllocale.pod @@ -298,8 +298,8 @@ C<ucfirst()>, and C<lcfirst()>) use C<LC_CTYPE> =item * -The variables L<$!|perlvar/$ERRNO> (and its synonyms C<$ERRNO> and -C<$OS_ERROR>) and L<$^E|perlvar/$EXTENDED_OS_ERROR> (and its synonym +B<The variables L<$!|perlvar/$ERRNO>> (and its synonyms C<$ERRNO> and +C<$OS_ERROR>) B<and L<$^E|perlvar/$EXTENDED_OS_ERROR>> (and its synonym C<$EXTENDED_OS_ERROR>) when used as strings use C<LC_MESSAGES>. =back @@ -755,7 +755,7 @@ alphabets, but where do "E<aacute>" and "E<aring>" belong? And while "color" follows "chocolate" in English, what about in traditional Spanish? The following collations all make sense and you may meet any of them -if you "use locale". +if you C<"use locale">. A B C D E a b c d e A a B b C c D d E e @@ -792,7 +792,7 @@ C<$equal_in_locale> will be true if the collation locale specifies a dictionary-like ordering that ignores space characters completely and which folds case. -Perl only supports single-byte locales for C<LC_COLLATE>. This means +Perl currently only supports single-byte locales for C<LC_COLLATE>. This means that a UTF-8 locale likely will just give you machine-native ordering. Use L<Unicode::Collate> for the full implementation of the Unicode Collation Algorithm. @@ -1005,7 +1005,7 @@ results. Here are a few possibilities: Regular expression checks for safe file names or mail addresses using C<\w> may be spoofed by an C<LC_CTYPE> locale that claims that -characters such as "E<gt>" and "|" are alphanumeric. +characters such as C<"E<gt>"> and C<"|"> are alphanumeric. =item * @@ -1466,9 +1466,12 @@ the characters in the upper half of the Latin-1 range (128 - 255) properly under C<LC_CTYPE>. To see if a character is a particular type under a locale, Perl uses the functions like C<isalnum()>. Your C library may not work for UTF-8 locales with those functions, instead -only working under the newer wide library functions like C<iswalnum()>. -However, they are treated like single-byte locales, and will have the -restrictions described below. +only working under the newer wide library functions like C<iswalnum()>, +which Perl does not use. +These multi-byte locales are treated like single-byte locales, and will +have the restrictions described below. Starting in Perl v5.22 a warning +message is raised when Perl detects a multi-byte locale that it doesn't +fully support. For single-byte locales, Perl generally takes the tack to use locale rules on code points that can fit @@ -1488,7 +1491,7 @@ Unicode, C<\p{Alpha}> will never match it, regardless of locale. A similar issue occurs with C<\N{...}>. Prior to v5.20, It is therefore a bad idea to use C<\p{}> or C<\N{}> under plain C<use locale>--I<unless> you can guarantee that the -locale will be a ISO8859-1. Use POSIX character classes instead. +locale will be ISO8859-1. Use POSIX character classes instead. Another problem with this approach is that operations that cross the single byte/multiple byte boundary are not well-defined, and so are @@ -1516,6 +1519,11 @@ Still another problem is that this approach can lead to two code points meaning the same character. Thus in a Greek locale, both U+03A7 and U+00D7 are GREEK CAPITAL LETTER CHI. +Because of all these problems, starting in v5.22, Perl will raise a +warning if a multi-byte (hence Unicode) code point is used when a +single-byte locale is in effect. (Although it doesn't check for this if +doing so would unreasonably slow execution down.) + Vendor locales are notoriously buggy, and it is difficult for Perl to test its locale-handling code because this interacts with code that Perl has no control over; therefore the locale-handling code in Perl may be buggy as @@ -1541,8 +1549,8 @@ Pre-v5.12, it was somewhat haphazard; in v5.12 it was applied fairly consistently to regular expression matching except for bracketed character classes; in v5.14 it was extended to all regex matches; and in v5.16 to the casing operations such as C<\L> and C<uc()>. For -collation, in all releases, the system's C<strxfrm()> function is called, -and whatever it does is what you get. +collation, in all releases so far, the system's C<strxfrm()> function is +called, and whatever it does is what you get. =head1 BUGS diff --git a/pp.c b/pp.c index 182fa71..08e0999 100644 --- a/pp.c +++ b/pp.c @@ -3588,23 +3588,27 @@ PP(pp_ucfirst) if (op_type == OP_LCFIRST) { /* lower case the first letter: no trickiness for any character */ - *tmpbuf = #ifdef USE_LOCALE_CTYPE - (IN_LC_RUNTIME(LC_CTYPE)) - ? toLOWER_LC(*s) - : + if (IN_LC_RUNTIME(LC_CTYPE)) { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + *tmpbuf = toLOWER_LC(*s); + } + else #endif - (IN_UNI_8_BIT) - ? toLOWER_LATIN1(*s) - : toLOWER(*s); + { + *tmpbuf = (IN_UNI_8_BIT) + ? toLOWER_LATIN1(*s) + : toLOWER(*s); + } } - /* is ucfirst() */ #ifdef USE_LOCALE_CTYPE + /* is ucfirst() */ else if (IN_LC_RUNTIME(LC_CTYPE)) { if (IN_UTF8_CTYPE_LOCALE) { goto do_uni_rules; } + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; *tmpbuf = (U8) toUPPER_LC(*s); /* This would be a bug if any locales have upper and title case different */ @@ -3909,6 +3913,7 @@ PP(pp_uc) if (IN_UTF8_CTYPE_LOCALE) { goto do_uni_rules; } + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; for (; s < send; d++, s++) *d = (U8) toUPPER_LC(*s); } @@ -4116,6 +4121,7 @@ PP(pp_lc) * whole thing in a tight loop, for speed, */ #ifdef USE_LOCALE_CTYPE if (IN_LC_RUNTIME(LC_CTYPE)) { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; for (; s < send; d++, s++) *d = toLOWER_LC(*s); } @@ -4298,6 +4304,7 @@ PP(pp_fc) if (IN_UTF8_CTYPE_LOCALE) { goto do_uni_folding; } + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; for (; s < send; d++, s++) *d = (U8) toFOLD_LC(*s); } diff --git a/proto.h b/proto.h index cf8e93d..ccd768f 100644 --- a/proto.h +++ b/proto.h @@ -7914,14 +7914,13 @@ STATIC bool S_isa_lookup(pTHX_ HV *stash, const char * const name, STRLEN len, U #endif #if defined(PERL_IN_UTF8_C) -STATIC UV S_check_locale_boundary_crossing(pTHX_ const char * const func_name, const U8* const p, const UV result, U8* const ustrp, STRLEN *lenp) +STATIC UV S_check_locale_boundary_crossing(pTHX_ const U8* const p, const UV result, U8* const ustrp, STRLEN *lenp) __attribute__warn_unused_result__ __attribute__nonnull__(pTHX_1) - __attribute__nonnull__(pTHX_2) - __attribute__nonnull__(pTHX_4) - __attribute__nonnull__(pTHX_5); + __attribute__nonnull__(pTHX_3) + __attribute__nonnull__(pTHX_4); #define PERL_ARGS_ASSERT_CHECK_LOCALE_BOUNDARY_CROSSING \ - assert(func_name); assert(p); assert(ustrp); assert(lenp) + assert(p); assert(ustrp); assert(lenp) PERL_STATIC_INLINE bool S_is_utf8_common(pTHX_ const U8 *const p, SV **swash, const char * const swashname, SV* const invlist) __attribute__warn_unused_result__ diff --git a/regcomp.c b/regcomp.c index a58080b..905d41b 100644 --- a/regcomp.c +++ b/regcomp.c @@ -1044,13 +1044,13 @@ S_ssc_init(pTHX_ const RExC_state_t *pRExC_state, regnode_ssc *ssc) ARG_SET(ssc, ANYOF_ONLY_HAS_BITMAP); ssc_anything(ssc); - /* If any portion of the regex is to operate under locale rules, - * initialization includes it. The reason this isn't done for all regexes - * is that the optimizer was written under the assumption that locale was - * all-or-nothing. Given the complexity and lack of documentation in the - * optimizer, and that there are inadequate test cases for locale, many - * parts of it may not work properly, it is safest to avoid locale unless - * necessary. */ + /* If any portion of the regex is to operate under locale rules that aren't + * fully known at compile time, initialization includes it. The reason + * this isn't done for all regexes is that the optimizer was written under + * the assumption that locale was all-or-nothing. Given the complexity and + * lack of documentation in the optimizer, and that there are inadequate + * test cases for locale, many parts of it may not work properly, it is + * safest to avoid locale unless necessary. */ if (RExC_contains_locale) { ANYOF_POSIXL_SETALL(ssc); } @@ -1879,7 +1879,7 @@ S_dump_trie_interim_table(pTHX_ const struct _reg_trie_data *trie, May be the same as tail. tail : item following the branch sequence count : words in the sequence - flags : currently the OP() type we will be building one of /EXACT(|F|FA|FU|FU_SS)/ + flags : currently the OP() type we will be building one of /EXACT(|F|FA|FU|FU_SS|L|FLU8)/ depth : indent depth Inplace optimizes a sequence of 2 or more Branch-Exact nodes into a TRIE node. @@ -2143,10 +2143,11 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch, #endif switch (flags) { - case EXACT: break; + case EXACT: case EXACTL: break; case EXACTFA: case EXACTFU_SS: - case EXACTFU: folder = PL_fold_latin1; break; + case EXACTFU: + case EXACTFLU8: folder = PL_fold_latin1; break; case EXACTF: folder = PL_fold; break; default: Perl_croak( aTHX_ "panic! In trie construction, unknown node type %u %s", (unsigned) flags, PL_reg_name[flags] ); } @@ -2157,7 +2158,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch, trie->wordcount = word_count; RExC_rxi->data->data[ data_slot ] = (void*)trie; trie->charmap = (U16 *) PerlMemShared_calloc( 256, sizeof(U16) ); - if (flags == EXACT) + if (flags == EXACT || flags == EXACTL) trie->bitmap = (char *) PerlMemShared_calloc( ANYOF_BITMAP_SIZE, 1 ); trie->wordinfo = (reg_trie_wordinfo *) PerlMemShared_calloc( trie->wordcount+1, sizeof(reg_trie_wordinfo)); @@ -3201,7 +3202,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour StructCopy(source,op,struct regnode_charclass); stclass = (regnode *)op; } - OP(stclass)+=2; /* covert the TRIE type to its AHO-CORASICK equivalent */ + OP(stclass)+=2; /* convert the TRIE type to its AHO-CORASICK equivalent */ ARG_SET( stclass, data_slot ); aho = (reg_ac_data *) PerlMemShared_calloc( 1, sizeof(reg_ac_data) ); @@ -3500,7 +3501,7 @@ S_join_exact(pTHX_ RExC_state_t *pRExC_state, regnode *scan, * this final joining, sequences could have been split over boundaries, and * hence missed). The sequences only happen in folding, hence for any * non-EXACT EXACTish node */ - if (OP(scan) != EXACT) { + if (OP(scan) != EXACT && OP(scan) != EXACTL) { U8* s0 = (U8*) STRING(scan); U8* s = s0; U8* s_end = s0 + STR_LEN(scan); @@ -4148,14 +4149,24 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, EXACTFU | EXACTFU EXACTFU_SS | EXACTFU EXACTFA | EXACTFA + EXACTL | EXACTL + EXACTFLU8 | EXACTFLU8 */ -#define TRIE_TYPE(X) ( ( NOTHING == (X) ) ? NOTHING : \ - ( EXACT == (X) ) ? EXACT : \ - ( EXACTFU == (X) || EXACTFU_SS == (X) ) ? EXACTFU : \ - ( EXACTFA == (X) ) ? EXACTFA : \ - 0 ) +#define TRIE_TYPE(X) ( ( NOTHING == (X) ) \ + ? NOTHING \ + : ( EXACT == (X) ) \ + ? EXACT \ + : ( EXACTFU == (X) || EXACTFU_SS == (X) ) \ + ? EXACTFU \ + : ( EXACTFA == (X) ) \ + ? EXACTFA \ + : ( EXACTL == (X) ) \ + ? EXACTL \ + : ( EXACTFLU8 == (X) ) \ + ? EXACTFLU8 \ + : 0 ) /* dont use tail as the end marker for this traverse */ for ( cur = startbranch ; cur != scan ; cur = regnext( cur ) ) { @@ -4471,7 +4482,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, continue; } } - else if (OP(scan) == EXACT) { + else if (OP(scan) == EXACT || OP(scan) == EXACTL) { SSize_t l = STR_LEN(scan); UV uc; if (UTF) { @@ -4589,7 +4600,10 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, case PLUS: if (flags & (SCF_DO_SUBSTR | SCF_DO_STCLASS)) { next = NEXTOPER(scan); - if (OP(next) == EXACT || (flags & SCF_DO_STCLASS)) { + if (OP(next) == EXACT + || OP(next) == EXACTL + || (flags & SCF_DO_STCLASS)) + { mincount = 1; maxcount = REG_INFTY; next = regnext(scan); @@ -5070,6 +5084,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%"UVuf" RHS=%"UVuf"\n", } break; + case ANYOFL: case ANYOF: if (flags & SCF_DO_STCLASS_AND) ssc_and(pRExC_state, data->start_class, @@ -6982,7 +6997,7 @@ reStudy: DEBUG_PEEP("first:",first,0); /* Ignore EXACT as we deal with it later. */ if (PL_regkind[OP(first)] == EXACT) { - if (OP(first) == EXACT) + if (OP(first) == EXACT || OP(first) == EXACTL) NOOP; /* Empty, get anchored substr later. */ else ri->regstclass = first; @@ -7332,7 +7347,7 @@ reStudy: && OP(regnext(first)) == END) r->extflags |= RXf_WHITE; else if ( r->extflags & RXf_SPLIT - && fop == EXACT + && (fop == EXACT || fop == EXACTL) && STR_LEN(first) == 1 && *(STRING(first)) == ' ' && OP(regnext(first)) == END ) @@ -11348,7 +11363,9 @@ S_compute_EXACTish(RExC_state_t *pRExC_state) PERL_ARGS_ASSERT_COMPUTE_EXACTISH; if (! FOLD) { - return EXACT; + return (LOC) + ? EXACTL + : EXACT; } op = get_regex_charset(RExC_flags); @@ -11446,7 +11463,9 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, for those. */ && ! _invlist_contains_cp(PL_utf8_foldable, code_point)) { - OP(node) = EXACT; + OP(node) = (LOC) + ? EXACTL + : EXACT; } } else if (code_point <= MAX_UTF8_TWO_BYTE) { @@ -12481,7 +12500,7 @@ tryagain: /* Here, are folding and are not UTF-8 encoded; therefore * the character must be in the range 0-255, and is not /l * (Not /l because we already handled these under /l in - * is_PROBLEMATIC_LOCALE_FOLD_cp */ + * is_PROBLEMATIC_LOCALE_FOLD_cp) */ if (IS_IN_SOME_FOLD_L1(ender)) { maybe_exact = FALSE; @@ -12747,10 +12766,14 @@ tryagain: * differently depending on UTF8ness of the target string * (for /u), or depending on locale for /l */ if (maybe_exact) { - OP(ret) = EXACT; + OP(ret) = (LOC) + ? EXACTL + : EXACT; } else if (maybe_exactfu) { - OP(ret) = EXACTFU; + OP(ret) = (LOC) + ? EXACTFLU8 + : EXACTFU; } } alloc_maybe_populate_EXACT(pRExC_state, ret, flagp, len, ender, @@ -13804,7 +13827,11 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, DEBUG_PARSE("clas"); /* Assume we are going to generate an ANYOF node. */ - ret = reganode(pRExC_state, ANYOF, 0); + ret = reganode(pRExC_state, + (LOC) + ? ANYOFL + : ANYOF, + 0); if (SIZE_ONLY) { RExC_size += ANYOF_SKIP; @@ -15283,7 +15310,9 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth, value = start; if (! FOLD) { - op = EXACT; + op = (LOC) + ? EXACTL + : EXACT; } else if (LOC) { @@ -15996,10 +16025,12 @@ S_regtail_study(pTHX_ RExC_state_t *pRExC_state, regnode *p, if ( exact ) { switch (OP(scan)) { case EXACT: + case EXACTL: case EXACTF: case EXACTFA_NO_TRIE: case EXACTFA: case EXACTFU: + case EXACTFLU8: case EXACTFU_SS: case EXACTFL: if( exact == PSEUDO ) @@ -16413,7 +16444,7 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o, const regmatch_ SV* bitmap_invlist; /* Will hold what the bit map contains */ - if (flags & ANYOF_LOCALE_FLAGS) + if (OP(o) == ANYOFL) sv_catpvs(sv, "{loc}"); if (flags & ANYOF_LOC_FOLD) sv_catpvs(sv, "{i}"); diff --git a/regcomp.h b/regcomp.h index 049ac43..a111893 100644 --- a/regcomp.h +++ b/regcomp.h @@ -389,7 +389,15 @@ struct regnode_ssc { * probably better than that commit anyway. But it could be reinstated if we * need a bit. The LOC flags are only for /l nodes; the reverted commit was * only for /d, so there are no combinatorial issues. The LOC flag to use is - * probably the POSIXL one. + * probably the POSIXL one. Now that there is an ANYOFL (locale) node, another + * option would be to make all of those include the POSIXL data structure, + * which would get rid of needing a separate POSIXL flag. But it would + * increase the size of all such nodes, so it's probably not as atractive as + * having an ANYOF_POSIXL node type. But if we did do it, note that not all 32 + * bits of that extra space are used, one bit of that could be set aside for + * the LOC_FOLD flag, yielding yet another bit. This would require extra code + * for masking, so again not the most attractive solution. + * * Several flags are not used in synthetic start class (SSC) nodes, so could be * shared should new flags be needed for SSCs, like SSC_MATCHES_EMPTY_STRING * now. */ diff --git a/regcomp.sym b/regcomp.sym index 02b278c..c20c5aa 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -58,6 +58,7 @@ REG_ANY REG_ANY, no 0 S ; Match any one character (except newline). SANY REG_ANY, no 0 S ; Match any one character. CANY REG_ANY, no 0 S ; Match any one byte. ANYOF ANYOF, sv 1 S ; Match character in (or not in) this class, single char match only +ANYOFL ANYOF, sv 1 S ; Like ANYOF, but /l is in effect #* POSIX Character Classes: # Order of the below is important. See ordering comment above. @@ -90,11 +91,16 @@ BRANCH BRANCH, node 0 V ; Match this alternative, or the next... # NOTE: the relative ordering of these types is important do not change it EXACT EXACT, str ; Match this string (preceded by length). +EXACTL EXACT, str ; Like EXACT, but /l is in effect. EXACTF EXACT, str ; Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len). EXACTFL EXACT, str ; Match this string (not guaranteed to be folded) using /il rules (w/len). EXACTFU EXACT, str ; Match this string (folded iff in UTF-8, length in folding doesn't change if not in UTF-8) using /iu rules (w/len). EXACTFA EXACT, str ; Match this string (not guaranteed to be folded) using /iaa rules (w/len). + +# End of important relative ordering. + EXACTFU_SS EXACT, str ; Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len). +EXACTFLU8 EXACT, str ; Rare cirucmstances: like EXACTFU, but is under /l, UTF-8, folded, and everything in it is above 255. EXACTFA_NO_TRIE EXACT, str ; Match this string (which is not trie-able; not guaranteed to be folded) using /iaa rules (w/len). #*Do nothing types diff --git a/regexec.c b/regexec.c index 1a5eb61..e659f4b 100644 --- a/regexec.c +++ b/regexec.c @@ -231,15 +231,15 @@ static const char* const non_utf8_target_but_utf8_required #if 0 /* Currently these are only used when PL_regkind[OP(rn)] == EXACT so - we don't need this definition. */ + we don't need this definition. XXX These are now out-of-sync*/ #define IS_TEXT(rn) ( OP(rn)==EXACT || OP(rn)==REF || OP(rn)==NREF ) #define IS_TEXTF(rn) ( OP(rn)==EXACTFU || OP(rn)==EXACTFU_SS || OP(rn)==EXACTFA || OP(rn)==EXACTFA_NO_TRIE || OP(rn)==EXACTF || OP(rn)==REFF || OP(rn)==NREFF ) #define IS_TEXTFL(rn) ( OP(rn)==EXACTFL || OP(rn)==REFFL || OP(rn)==NREFFL ) #else /* ... so we use this as its faster. */ -#define IS_TEXT(rn) ( OP(rn)==EXACT ) -#define IS_TEXTFU(rn) ( OP(rn)==EXACTFU || OP(rn)==EXACTFU_SS || OP(rn) == EXACTFA || OP(rn) == EXACTFA_NO_TRIE) +#define IS_TEXT(rn) ( OP(rn)==EXACT || OP(rn)==EXACTL ) +#define IS_TEXTFU(rn) ( OP(rn)==EXACTFU || OP(rn)==EXACTFLU8 || OP(rn)==EXACTFU_SS || OP(rn) == EXACTFA || OP(rn) == EXACTFA_NO_TRIE) #define IS_TEXTF(rn) ( OP(rn)==EXACTF ) #define IS_TEXTFL(rn) ( OP(rn)==EXACTFL ) @@ -498,7 +498,7 @@ S_isFOO_utf8_lc(pTHX_ const U8 classnum, const U8* character) * '_char_class_number'. * * This just calls isFOO_lc on the code point for the character if it is in - * the range 0-255. Outside that range, all characters avoid Unicode + * the range 0-255. Outside that range, all characters use Unicode * rules, ignoring any locale. So use the Unicode function if this class * requires a swash, and use the Unicode macro otherwise. */ @@ -512,6 +512,8 @@ S_isFOO_utf8_lc(pTHX_ const U8 classnum, const U8* character) TWO_BYTE_UTF8_TO_NATIVE(*character, *(character + 1))); } + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(character, character + UTF8SKIP(character)); + if (classnum < _FIRST_NON_SWASH_CC) { /* Initialize the swash unless done already */ @@ -1433,23 +1435,39 @@ Perl_re_intuit_start(pTHX_ #define DECL_TRIE_TYPE(scan) \ - const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold, \ - trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold } \ - trie_type = ((scan->flags == EXACT) \ - ? (utf8_target ? trie_utf8 : trie_plain) \ - : (scan->flags == EXACTFA) \ - ? (utf8_target ? trie_utf8_exactfa_fold : trie_latin_utf8_exactfa_fold) \ - : (utf8_target ? trie_utf8_fold : trie_latin_utf8_fold)) + const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold, \ + trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold, \ + trie_utf8l, trie_flu8 } \ + trie_type = ((scan->flags == EXACT) \ + ? (utf8_target ? trie_utf8 : trie_plain) \ + : (scan->flags == EXACTL) \ + ? (utf8_target ? trie_utf8l : trie_plain) \ + : (scan->flags == EXACTFA) \ + ? (utf8_target \ + ? trie_utf8_exactfa_fold \ + : trie_latin_utf8_exactfa_fold) \ + : (scan->flags == EXACTFLU8 \ + ? trie_flu8 \ + : (utf8_target \ + ? trie_utf8_fold \ + : trie_latin_utf8_fold))) #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \ STMT_START { \ STRLEN skiplen; \ U8 flags = FOLD_FLAGS_FULL; \ switch (trie_type) { \ + case trie_flu8: \ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; \ + if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) { \ + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc)); \ + } \ + goto do_trie_utf8_fold; \ case trie_utf8_exactfa_fold: \ flags |= FOLD_FLAGS_NOMIX_ASCII; \ - /* FALLTHROUGH */ \ + /* FALLTHROUGH */ \ case trie_utf8_fold: \ + do_trie_utf8_fold: \ if ( foldlen>0 ) { \ uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \ foldlen -= len; \ @@ -1465,7 +1483,7 @@ STMT_START { break; \ case trie_latin_utf8_exactfa_fold: \ flags |= FOLD_FLAGS_NOMIX_ASCII; \ - /* FALLTHROUGH */ \ + /* FALLTHROUGH */ \ case trie_latin_utf8_fold: \ if ( foldlen>0 ) { \ uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \ @@ -1480,6 +1498,12 @@ STMT_START { uscan = foldbuf + skiplen; \ } \ break; \ + case trie_utf8l: \ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; \ + if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) { \ + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc)); \ + } \ + /* FALLTHROUGH */ \ case trie_utf8: \ uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags ); \ break; \ @@ -1739,6 +1763,9 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, /* We know what class it must start with. */ switch (OP(c)) { + case ANYOFL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + /* FALLTHROUGH */ case ANYOF: if (utf8_target) { REXEC_FBC_UTF8_CLASS_SCAN( @@ -1780,6 +1807,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, goto do_exactf_non_utf8; case EXACTFL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (is_utf8_pat || utf8_target || IN_UTF8_CTYPE_LOCALE) { utf8_fold_flags = FOLDEQ_LOCALE; goto do_exactf_utf8; @@ -1794,6 +1822,15 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, } goto do_exactf_utf8; + case EXACTFLU8: + if (! utf8_target) { /* All code points in this node require + UTF-8 to express. */ + break; + } + utf8_fold_flags = FOLDEQ_LOCALE | FOLDEQ_S2_ALREADY_FOLDED + | FOLDEQ_S2_FOLDS_SANE; + goto do_exactf_utf8; + case EXACTFU: if (is_utf8_pat || utf8_target) { utf8_fold_flags = is_utf8_pat ? FOLDEQ_S2_ALREADY_FOLDED : 0; @@ -1899,9 +1936,11 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, } case BOUNDL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; FBC_BOUND(isWORDCHAR_LC, isWORDCHAR_LC_uvchr, isWORDCHAR_LC_utf8); break; case NBOUNDL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; FBC_NBOUND(isWORDCHAR_LC, isWORDCHAR_LC_uvchr, isWORDCHAR_LC_utf8); break; case BOUND: @@ -1936,6 +1975,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, /* FALLTHROUGH */ case POSIXL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; REXEC_FBC_CSCAN(to_complement ^ cBOOL(isFOO_utf8_lc(FLAGS(c), (U8 *) s)), to_complement ^ cBOOL(isFOO_lc(FLAGS(c), *s))); break; @@ -3648,7 +3688,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p, U8 *pat = (U8*)STRING(text_node); U8 folded[UTF8_MAX_FOLD_CHAR_EXPAND * UTF8_MAXBYTES_CASE + 1] = { '\0' }; - if (OP(text_node) == EXACT) { + if (OP(text_node) == EXACT || OP(text_node) == EXACTL) { /* In an exact node, only one thing can be matched, that first * character. If both the pat and the target are UTF-8, we can just @@ -4152,6 +4192,19 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate; + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + if (utf8_target + && UTF8_IS_ABOVE_LATIN1(nextchr) + && scan->flags == EXACTL) + { + /* We only output for EXACTL, as we let the folder + * output this message for EXACTFLU8 to avoid + * duplication */ + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(locinput, + reginfo->strend); + } + } if ( trie->bitmap && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr))) { @@ -4425,6 +4478,19 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } #undef ST + case EXACTL: /* /abc/l */ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + + /* Complete checking would involve going through every character + * matched by the string to see if any is above latin1. But the + * comparision otherwise might very well be a fast assembly + * language routine, and I (khw) don't think slowing things down + * just to check for this warning is worth it. So this just checks + * the first character */ + if (utf8_target && UTF8_IS_ABOVE_LATIN1(*locinput)) { + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(locinput, reginfo->strend); + } + /* FALLTHROUGH */ case EXACT: { /* /abc/ */ char *s = STRING(scan); ln = STR_LEN(scan); @@ -4511,11 +4577,22 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) const char * s; U32 fold_utf8_flags; + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; folder = foldEQ_locale; fold_array = PL_fold_locale; fold_utf8_flags = FOLDEQ_LOCALE; goto do_exactf; + case EXACTFLU8: /* /abc/il; but all 'abc' are above 255, so + is effectively /u; hence to match, target + must be UTF-8. */ + if (! utf8_target) { + sayNO; + } + fold_utf8_flags = FOLDEQ_LOCALE | FOLDEQ_S1_ALREADY_FOLDED + | FOLDEQ_S1_FOLDS_SANE; + goto do_exactf; + case EXACTFU_SS: /* /\x{df}/iu */ case EXACTFU: /* /abc/iu */ folder = foldEQ_latin1; @@ -4583,6 +4660,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) * have to set the FLAGS fields of these */ case BOUNDL: /* /\b/l */ case NBOUNDL: /* /\B/l */ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + /* FALLTHROUGH */ case BOUND: /* /\b/ */ case BOUNDU: /* /\b/u */ case BOUNDA: /* /\b/a */ @@ -4661,7 +4740,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) sayNO; break; - case ANYOF: /* /[abc]/ */ + case ANYOFL: /* /[abc]/l */ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + /* FALLTHROUGH */ + case ANYOF: /* /[abc]/ */ if (NEXTCHR_IS_EOS) sayNO; if (utf8_target) { @@ -4685,6 +4767,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) /* FALLTHROUGH */ case POSIXL: /* \w or [:punct:] etc. under /l */ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (NEXTCHR_IS_EOS) sayNO; @@ -4705,7 +4788,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) } } else { /* Here, must be an above Latin-1 code point */ - goto utf8_posix_not_eos; + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(locinput, reginfo->strend); + goto utf8_posix_above_latin1; } /* Here, must be utf8 */ @@ -4764,7 +4848,6 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) if (NEXTCHR_IS_EOS) { sayNO; } - utf8_posix_not_eos: /* Use _generic_isCC() for characters within Latin1. (Note that * UTF8_IS_INVARIANT works even on non-UTF-8 strings, or else @@ -4788,6 +4871,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) locinput += 2; } else { /* Handle above Latin-1 code points */ + utf8_posix_above_latin1: classnum = (_char_class_number) FLAGS(scan); if (classnum < _FIRST_NON_SWASH_CC) { @@ -5061,6 +5145,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) const U8 *fold_array; UV utf8_fold_flags; + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; folder = foldEQ_locale; fold_array = PL_fold_locale; type = REFFL; @@ -5105,6 +5190,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) goto do_nref_ref_common; case REFFL: /* /\1/il */ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; folder = foldEQ_locale; fold_array = PL_fold_locale; utf8_fold_flags = FOLDEQ_LOCALE; @@ -7174,6 +7260,12 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, scan = loceol; } break; + case EXACTL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + if (utf8_target && UTF8_IS_ABOVE_LATIN1(*scan)) { + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(scan, loceol); + } + /* FALLTHROUGH */ case EXACT: assert(STR_LEN(p) == reginfo->is_utf8_pat ? UTF8SKIP(STRING(p)) : 1); @@ -7247,6 +7339,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, goto do_exactf; case EXACTFL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; utf8_flags = FOLDEQ_LOCALE; goto do_exactf; @@ -7255,6 +7348,14 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, utf8_flags = 0; goto do_exactf; + case EXACTFLU8: + if (! utf8_target) { + break; + } + utf8_flags = FOLDEQ_LOCALE | FOLDEQ_S2_ALREADY_FOLDED + | FOLDEQ_S2_FOLDS_SANE; + goto do_exactf; + case EXACTFU_SS: case EXACTFU: utf8_flags = reginfo->is_utf8_pat ? FOLDEQ_S2_ALREADY_FOLDED : 0; @@ -7318,6 +7419,9 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, } break; } + case ANYOFL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + /* FALLTHROUGH */ case ANYOF: if (utf8_target) { while (hardcount < max @@ -7340,6 +7444,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, /* FALLTHROUGH */ case POSIXL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (! utf8_target) { while (scan < loceol && to_complement ^ cBOOL(isFOO_lc(FLAGS(p), *scan))) @@ -7559,16 +7664,18 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p, } break; + case BOUNDL: + case NBOUNDL: + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + /* FALLTHROUGH */ case BOUND: case BOUNDA: - case BOUNDL: case BOUNDU: case EOS: case GPOS: case KEEPS: case NBOUND: case NBOUNDA: - case NBOUNDL: case NBOUNDU: case OPFAIL: case SBOL: @@ -7627,7 +7734,7 @@ Perl_regclass_swash(pTHX_ const regexp *prog, const regnode* node, bool doinit, /* - reginclass - determine if a character falls into a character class - n is the ANYOF regnode + n is the ANYOF-type regnode p is the target string p_end points to one byte beyond the end of the target string utf8_target tells whether p is in UTF-8. @@ -7661,6 +7768,9 @@ S_reginclass(pTHX_ regexp * const prog, const regnode * const n, const U8* const * UTF8_ALLOW_FFFF */ if (c_len == (STRLEN)-1) Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); + if (c > 255 && OP(n) == ANYOFL && ! is_ANYOF_SYNTHETIC(n)) { + _CHECK_AND_OUTPUT_WIDE_LOCALE_CP_MSG(c); + } } /* If this character is potentially in the bitmap, check it */ diff --git a/regnodes.h b/regnodes.h index 41662a0..94616a6 100644 --- a/regnodes.h +++ b/regnodes.h @@ -6,8 +6,8 @@ /* Regops and State definitions */ -#define REGNODE_MAX 90 -#define REGMATCH_STATE_MAX 130 +#define REGNODE_MAX 93 +#define REGMATCH_STATE_MAX 133 #define END 0 /* 0000 End of program. */ #define SUCCEED 1 /* 0x01 Return from a subroutine, basically. */ @@ -31,77 +31,80 @@ #define SANY 17 /* 0x11 Match any one character. */ #define CANY 18 /* 0x12 Match any one byte. */ #define ANYOF 19 /* 0x13 Match character in (or not in) this class, single char match only */ -#define POSIXD 20 /* 0x14 Some [[:class:]] under /d; the FLAGS field gives which one */ -#define POSIXL 21 /* 0x15 Some [[:class:]] under /l; the FLAGS field gives which one */ -#define POSIXU 22 /* 0x16 Some [[:class:]] under /u; the FLAGS field gives which one */ -#define POSIXA 23 /* 0x17 Some [[:class:]] under /a; the FLAGS field gives which one */ -#define NPOSIXD 24 /* 0x18 complement of POSIXD, [[:^class:]] */ -#define NPOSIXL 25 /* 0x19 complement of POSIXL, [[:^class:]] */ -#define NPOSIXU 26 /* 0x1a complement of POSIXU, [[:^class:]] */ -#define NPOSIXA 27 /* 0x1b complement of POSIXA, [[:^class:]] */ -#define CLUMP 28 /* 0x1c Match any extended grapheme cluster sequence */ -#define BRANCH 29 /* 0x1d Match this alternative, or the next... */ -#define EXACT 30 /* 0x1e Match this string (preceded by length). */ -#define EXACTF 31 /* 0x1f Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len). */ -#define EXACTFL 32 /* 0x20 Match this string (not guaranteed to be folded) using /il rules (w/len). */ -#define EXACTFU 33 /* 0x21 Match this string (folded iff in UTF-8, length in folding doesn't change if not in UTF-8) using /iu rules (w/len). */ -#define EXACTFA 34 /* 0x22 Match this string (not guaranteed to be folded) using /iaa rules (w/len). */ -#define EXACTFU_SS 35 /* 0x23 Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len). */ -#define EXACTFA_NO_TRIE 36 /* 0x24 Match this string (which is not trie-able; not guaranteed to be folded) using /iaa rules (w/len). */ -#define NOTHING 37 /* 0x25 Match empty string. */ -#define TAIL 38 /* 0x26 Match empty string. Can jump here from outside. */ -#define STAR 39 /* 0x27 Match this (simple) thing 0 or more times. */ -#define PLUS 40 /* 0x28 Match this (simple) thing 1 or more times. */ -#define CURLY 41 /* 0x29 Match this simple thing {n,m} times. */ -#define CURLYN 42 /* 0x2a Capture next-after-this simple thing */ -#define CURLYM 43 /* 0x2b Capture this medium-complex thing {n,m} times. */ -#define CURLYX 44 /* 0x2c Match this complex thing {n,m} times. */ -#define WHILEM 45 /* 0x2d Do curly processing and see if rest matches. */ -#define OPEN 46 /* 0x2e Mark this point in input as start of #n. */ -#define CLOSE 47 /* 0x2f Analogous to OPEN. */ -#define REF 48 /* 0x30 Match some already matched string */ -#define REFF 49 /* 0x31 Match already matched string, folded using native charset rules for non-utf8 */ -#define REFFL 50 /* 0x32 Match already matched string, folded in loc. */ -#define REFFU 51 /* 0x33 Match already matched string, folded using unicode rules for non-utf8 */ -#define REFFA 52 /* 0x34 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ -#define NREF 53 /* 0x35 Match some already matched string */ -#define NREFF 54 /* 0x36 Match already matched string, folded using native charset rules for non-utf8 */ -#define NREFFL 55 /* 0x37 Match already matched string, folded in loc. */ -#define NREFFU 56 /* 0x38 Match already matched string, folded using unicode rules for non-utf8 */ -#define NREFFA 57 /* 0x39 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ -#define LONGJMP 58 /* 0x3a Jump far away. */ -#define BRANCHJ 59 /* 0x3b BRANCH with long offset. */ -#define IFMATCH 60 /* 0x3c Succeeds if the following matches. */ -#define UNLESSM 61 /* 0x3d Fails if the following matches. */ -#define SUSPEND 62 /* 0x3e "Independent" sub-RE. */ -#define IFTHEN 63 /* 0x3f Switch, should be preceded by switcher. */ -#define GROUPP 64 /* 0x40 Whether the group matched. */ -#define EVAL 65 /* 0x41 Execute some Perl code. */ -#define MINMOD 66 /* 0x42 Next operator is not greedy. */ -#define LOGICAL 67 /* 0x43 Next opcode should set the flag only. */ -#define RENUM 68 /* 0x44 Group with independently numbered parens. */ -#define TRIE 69 /* 0x45 Match many EXACT(F[ALU]?)? at once. flags==type */ -#define TRIEC 70 /* 0x46 Same as TRIE, but with embedded charclass data */ -#define AHOCORASICK 71 /* 0x47 Aho Corasick stclass. flags==type */ -#define AHOCORASICKC 72 /* 0x48 Same as AHOCORASICK, but with embedded charclass data */ -#define GOSUB 73 /* 0x49 recurse to paren arg1 at (signed) ofs arg2 */ -#define GOSTART 74 /* 0x4a recurse to start of pattern */ -#define NGROUPP 75 /* 0x4b Whether the group matched. */ -#define INSUBP 76 /* 0x4c Whether we are in a specific recurse. */ -#define DEFINEP 77 /* 0x4d Never execute directly. */ -#define ENDLIKE 78 /* 0x4e Used only for the type field of verbs */ -#define OPFAIL 79 /* 0x4f Same as (?!) */ -#define ACCEPT 80 /* 0x50 Accepts the current matched string. */ -#define VERB 81 /* 0x51 Used only for the type field of verbs */ -#define PRUNE 82 /* 0x52 Pattern fails at this startpoint if no-backtracking through this */ -#define MARKPOINT 83 /* 0x53 Push the current location for rollback by cut. */ -#define SKIP 84 /* 0x54 On failure skip forward (to the mark) before retrying */ -#define COMMIT 85 /* 0x55 Pattern fails outright if backtracking through this */ -#define CUTGROUP 86 /* 0x56 On failure go to the next alternation in the group */ -#define KEEPS 87 /* 0x57 $& begins here. */ -#define LNBREAK 88 /* 0x58 generic newline pattern */ -#define OPTIMIZED 89 /* 0x59 Placeholder for dump. */ -#define PSEUDO 90 /* 0x5a Pseudo opcode for internal use. */ +#define ANYOFL 20 /* 0x14 Like ANYOF, but /l is in effect */ +#define POSIXD 21 /* 0x15 Some [[:class:]] under /d; the FLAGS field gives which one */ +#define POSIXL 22 /* 0x16 Some [[:class:]] under /l; the FLAGS field gives which one */ +#define POSIXU 23 /* 0x17 Some [[:class:]] under /u; the FLAGS field gives which one */ +#define POSIXA 24 /* 0x18 Some [[:class:]] under /a; the FLAGS field gives which one */ +#define NPOSIXD 25 /* 0x19 complement of POSIXD, [[:^class:]] */ +#define NPOSIXL 26 /* 0x1a complement of POSIXL, [[:^class:]] */ +#define NPOSIXU 27 /* 0x1b complement of POSIXU, [[:^class:]] */ +#define NPOSIXA 28 /* 0x1c complement of POSIXA, [[:^class:]] */ +#define CLUMP 29 /* 0x1d Match any extended grapheme cluster sequence */ +#define BRANCH 30 /* 0x1e Match this alternative, or the next... */ +#define EXACT 31 /* 0x1f Match this string (preceded by length). */ +#define EXACTL 32 /* 0x20 Like EXACT, but /l is in effect. */ +#define EXACTF 33 /* 0x21 Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len). */ +#define EXACTFL 34 /* 0x22 Match this string (not guaranteed to be folded) using /il rules (w/len). */ +#define EXACTFU 35 /* 0x23 Match this string (folded iff in UTF-8, length in folding doesn't change if not in UTF-8) using /iu rules (w/len). */ +#define EXACTFA 36 /* 0x24 Match this string (not guaranteed to be folded) using /iaa rules (w/len). */ +#define EXACTFU_SS 37 /* 0x25 Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len). */ +#define EXACTFLU8 38 /* 0x26 Rare cirucmstances: like EXACTFU, but is under /l, UTF-8, folded, and everything in it is above 255. */ +#define EXACTFA_NO_TRIE 39 /* 0x27 Match this string (which is not trie-able; not guaranteed to be folded) using /iaa rules (w/len). */ +#define NOTHING 40 /* 0x28 Match empty string. */ +#define TAIL 41 /* 0x29 Match empty string. Can jump here from outside. */ +#define STAR 42 /* 0x2a Match this (simple) thing 0 or more times. */ +#define PLUS 43 /* 0x2b Match this (simple) thing 1 or more times. */ +#define CURLY 44 /* 0x2c Match this simple thing {n,m} times. */ +#define CURLYN 45 /* 0x2d Capture next-after-this simple thing */ +#define CURLYM 46 /* 0x2e Capture this medium-complex thing {n,m} times. */ +#define CURLYX 47 /* 0x2f Match this complex thing {n,m} times. */ +#define WHILEM 48 /* 0x30 Do curly processing and see if rest matches. */ +#define OPEN 49 /* 0x31 Mark this point in input as start of #n. */ +#define CLOSE 50 /* 0x32 Analogous to OPEN. */ +#define REF 51 /* 0x33 Match some already matched string */ +#define REFF 52 /* 0x34 Match already matched string, folded using native charset rules for non-utf8 */ +#define REFFL 53 /* 0x35 Match already matched string, folded in loc. */ +#define REFFU 54 /* 0x36 Match already matched string, folded using unicode rules for non-utf8 */ +#define REFFA 55 /* 0x37 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ +#define NREF 56 /* 0x38 Match some already matched string */ +#define NREFF 57 /* 0x39 Match already matched string, folded using native charset rules for non-utf8 */ +#define NREFFL 58 /* 0x3a Match already matched string, folded in loc. */ +#define NREFFU 59 /* 0x3b Match already matched string, folded using unicode rules for non-utf8 */ +#define NREFFA 60 /* 0x3c Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ +#define LONGJMP 61 /* 0x3d Jump far away. */ +#define BRANCHJ 62 /* 0x3e BRANCH with long offset. */ +#define IFMATCH 63 /* 0x3f Succeeds if the following matches. */ +#define UNLESSM 64 /* 0x40 Fails if the following matches. */ +#define SUSPEND 65 /* 0x41 "Independent" sub-RE. */ +#define IFTHEN 66 /* 0x42 Switch, should be preceded by switcher. */ +#define GROUPP 67 /* 0x43 Whether the group matched. */ +#define EVAL 68 /* 0x44 Execute some Perl code. */ +#define MINMOD 69 /* 0x45 Next operator is not greedy. */ +#define LOGICAL 70 /* 0x46 Next opcode should set the flag only. */ +#define RENUM 71 /* 0x47 Group with independently numbered parens. */ +#define TRIE 72 /* 0x48 Match many EXACT(F[ALU]?)? at once. flags==type */ +#define TRIEC 73 /* 0x49 Same as TRIE, but with embedded charclass data */ +#define AHOCORASICK 74 /* 0x4a Aho Corasick stclass. flags==type */ +#define AHOCORASICKC 75 /* 0x4b Same as AHOCORASICK, but with embedded charclass data */ +#define GOSUB 76 /* 0x4c recurse to paren arg1 at (signed) ofs arg2 */ +#define GOSTART 77 /* 0x4d recurse to start of pattern */ +#define NGROUPP 78 /* 0x4e Whether the group matched. */ +#define INSUBP 79 /* 0x4f Whether we are in a specific recurse. */ +#define DEFINEP 80 /* 0x50 Never execute directly. */ +#define ENDLIKE 81 /* 0x51 Used only for the type field of verbs */ +#define OPFAIL 82 /* 0x52 Same as (?!) */ +#define ACCEPT 83 /* 0x53 Accepts the current matched string. */ +#define VERB 84 /* 0x54 Used only for the type field of verbs */ +#define PRUNE 85 /* 0x55 Pattern fails at this startpoint if no-backtracking through this */ +#define MARKPOINT 86 /* 0x56 Push the current location for rollback by cut. */ +#define SKIP 87 /* 0x57 On failure skip forward (to the mark) before retrying */ +#define COMMIT 88 /* 0x58 Pattern fails outright if backtracking through this */ +#define CUTGROUP 89 /* 0x59 On failure go to the next alternation in the group */ +#define KEEPS 90 /* 0x5a $& begins here. */ +#define LNBREAK 91 /* 0x5b generic newline pattern */ +#define OPTIMIZED 92 /* 0x5c Placeholder for dump. */ +#define PSEUDO 93 /* 0x5d Pseudo opcode for internal use. */ /* ------------ States ------------- */ #define TRIE_next (REGNODE_MAX + 1) /* state for TRIE */ #define TRIE_next_fail (REGNODE_MAX + 2) /* state for TRIE */ @@ -170,6 +173,7 @@ EXTCONST U8 PL_regkind[] = { REG_ANY, /* SANY */ REG_ANY, /* CANY */ ANYOF, /* ANYOF */ + ANYOF, /* ANYOFL */ POSIXD, /* POSIXD */ POSIXD, /* POSIXL */ POSIXD, /* POSIXU */ @@ -181,11 +185,13 @@ EXTCONST U8 PL_regkind[] = { CLUMP, /* CLUMP */ BRANCH, /* BRANCH */ EXACT, /* EXACT */ + EXACT, /* EXACTL */ EXACT, /* EXACTF */ EXACT, /* EXACTFL */ EXACT, /* EXACTFU */ EXACT, /* EXACTFA */ EXACT, /* EXACTFU_SS */ + EXACT, /* EXACTFLU8 */ EXACT, /* EXACTFA_NO_TRIE */ NOTHING, /* NOTHING */ NOTHING, /* TAIL */ @@ -309,6 +315,7 @@ static const U8 regarglen[] = { 0, /* SANY */ 0, /* CANY */ EXTRA_SIZE(struct regnode_1), /* ANYOF */ + EXTRA_SIZE(struct regnode_1), /* ANYOFL */ 0, /* POSIXD */ 0, /* POSIXL */ 0, /* POSIXU */ @@ -320,11 +327,13 @@ static const U8 regarglen[] = { 0, /* CLUMP */ 0, /* BRANCH */ 0, /* EXACT */ + 0, /* EXACTL */ 0, /* EXACTF */ 0, /* EXACTFL */ 0, /* EXACTFU */ 0, /* EXACTFA */ 0, /* EXACTFU_SS */ + 0, /* EXACTFLU8 */ 0, /* EXACTFA_NO_TRIE */ 0, /* NOTHING */ 0, /* TAIL */ @@ -405,6 +414,7 @@ static const char reg_off_by_arg[] = { 0, /* SANY */ 0, /* CANY */ 0, /* ANYOF */ + 0, /* ANYOFL */ 0, /* POSIXD */ 0, /* POSIXL */ 0, /* POSIXU */ @@ -416,11 +426,13 @@ static const char reg_off_by_arg[] = { 0, /* CLUMP */ 0, /* BRANCH */ 0, /* EXACT */ + 0, /* EXACTL */ 0, /* EXACTF */ 0, /* EXACTFL */ 0, /* EXACTFU */ 0, /* EXACTFA */ 0, /* EXACTFU_SS */ + 0, /* EXACTFLU8 */ 0, /* EXACTFA_NO_TRIE */ 0, /* NOTHING */ 0, /* TAIL */ @@ -506,77 +518,80 @@ EXTCONST char * const PL_reg_name[] = { "SANY", /* 0x11 */ "CANY", /* 0x12 */ "ANYOF", /* 0x13 */ - "POSIXD", /* 0x14 */ - "POSIXL", /* 0x15 */ - "POSIXU", /* 0x16 */ - "POSIXA", /* 0x17 */ - "NPOSIXD", /* 0x18 */ - "NPOSIXL", /* 0x19 */ - "NPOSIXU", /* 0x1a */ - "NPOSIXA", /* 0x1b */ - "CLUMP", /* 0x1c */ - "BRANCH", /* 0x1d */ - "EXACT", /* 0x1e */ - "EXACTF", /* 0x1f */ - "EXACTFL", /* 0x20 */ - "EXACTFU", /* 0x21 */ - "EXACTFA", /* 0x22 */ - "EXACTFU_SS", /* 0x23 */ - "EXACTFA_NO_TRIE", /* 0x24 */ - "NOTHING", /* 0x25 */ - "TAIL", /* 0x26 */ - "STAR", /* 0x27 */ - "PLUS", /* 0x28 */ - "CURLY", /* 0x29 */ - "CURLYN", /* 0x2a */ - "CURLYM", /* 0x2b */ - "CURLYX", /* 0x2c */ - "WHILEM", /* 0x2d */ - "OPEN", /* 0x2e */ - "CLOSE", /* 0x2f */ - "REF", /* 0x30 */ - "REFF", /* 0x31 */ - "REFFL", /* 0x32 */ - "REFFU", /* 0x33 */ - "REFFA", /* 0x34 */ - "NREF", /* 0x35 */ - "NREFF", /* 0x36 */ - "NREFFL", /* 0x37 */ - "NREFFU", /* 0x38 */ - "NREFFA", /* 0x39 */ - "LONGJMP", /* 0x3a */ - "BRANCHJ", /* 0x3b */ - "IFMATCH", /* 0x3c */ - "UNLESSM", /* 0x3d */ - "SUSPEND", /* 0x3e */ - "IFTHEN", /* 0x3f */ - "GROUPP", /* 0x40 */ - "EVAL", /* 0x41 */ - "MINMOD", /* 0x42 */ - "LOGICAL", /* 0x43 */ - "RENUM", /* 0x44 */ - "TRIE", /* 0x45 */ - "TRIEC", /* 0x46 */ - "AHOCORASICK", /* 0x47 */ - "AHOCORASICKC", /* 0x48 */ - "GOSUB", /* 0x49 */ - "GOSTART", /* 0x4a */ - "NGROUPP", /* 0x4b */ - "INSUBP", /* 0x4c */ - "DEFINEP", /* 0x4d */ - "ENDLIKE", /* 0x4e */ - "OPFAIL", /* 0x4f */ - "ACCEPT", /* 0x50 */ - "VERB", /* 0x51 */ - "PRUNE", /* 0x52 */ - "MARKPOINT", /* 0x53 */ - "SKIP", /* 0x54 */ - "COMMIT", /* 0x55 */ - "CUTGROUP", /* 0x56 */ - "KEEPS", /* 0x57 */ - "LNBREAK", /* 0x58 */ - "OPTIMIZED", /* 0x59 */ - "PSEUDO", /* 0x5a */ + "ANYOFL", /* 0x14 */ + "POSIXD", /* 0x15 */ + "POSIXL", /* 0x16 */ + "POSIXU", /* 0x17 */ + "POSIXA", /* 0x18 */ + "NPOSIXD", /* 0x19 */ + "NPOSIXL", /* 0x1a */ + "NPOSIXU", /* 0x1b */ + "NPOSIXA", /* 0x1c */ + "CLUMP", /* 0x1d */ + "BRANCH", /* 0x1e */ + "EXACT", /* 0x1f */ + "EXACTL", /* 0x20 */ + "EXACTF", /* 0x21 */ + "EXACTFL", /* 0x22 */ + "EXACTFU", /* 0x23 */ + "EXACTFA", /* 0x24 */ + "EXACTFU_SS", /* 0x25 */ + "EXACTFLU8", /* 0x26 */ + "EXACTFA_NO_TRIE", /* 0x27 */ + "NOTHING", /* 0x28 */ + "TAIL", /* 0x29 */ + "STAR", /* 0x2a */ + "PLUS", /* 0x2b */ + "CURLY", /* 0x2c */ + "CURLYN", /* 0x2d */ + "CURLYM", /* 0x2e */ + "CURLYX", /* 0x2f */ + "WHILEM", /* 0x30 */ + "OPEN", /* 0x31 */ + "CLOSE", /* 0x32 */ + "REF", /* 0x33 */ + "REFF", /* 0x34 */ + "REFFL", /* 0x35 */ + "REFFU", /* 0x36 */ + "REFFA", /* 0x37 */ + "NREF", /* 0x38 */ + "NREFF", /* 0x39 */ + "NREFFL", /* 0x3a */ + "NREFFU", /* 0x3b */ + "NREFFA", /* 0x3c */ + "LONGJMP", /* 0x3d */ + "BRANCHJ", /* 0x3e */ + "IFMATCH", /* 0x3f */ + "UNLESSM", /* 0x40 */ + "SUSPEND", /* 0x41 */ + "IFTHEN", /* 0x42 */ + "GROUPP", /* 0x43 */ + "EVAL", /* 0x44 */ + "MINMOD", /* 0x45 */ + "LOGICAL", /* 0x46 */ + "RENUM", /* 0x47 */ + "TRIE", /* 0x48 */ + "TRIEC", /* 0x49 */ + "AHOCORASICK", /* 0x4a */ + "AHOCORASICKC", /* 0x4b */ + "GOSUB", /* 0x4c */ + "GOSTART", /* 0x4d */ + "NGROUPP", /* 0x4e */ + "INSUBP", /* 0x4f */ + "DEFINEP", /* 0x50 */ + "ENDLIKE", /* 0x51 */ + "OPFAIL", /* 0x52 */ + "ACCEPT", /* 0x53 */ + "VERB", /* 0x54 */ + "PRUNE", /* 0x55 */ + "MARKPOINT", /* 0x56 */ + "SKIP", /* 0x57 */ + "COMMIT", /* 0x58 */ + "CUTGROUP", /* 0x59 */ + "KEEPS", /* 0x5a */ + "LNBREAK", /* 0x5b */ + "OPTIMIZED", /* 0x5c */ + "PSEUDO", /* 0x5d */ /* ------------ States ------------- */ "TRIE_next", /* REGNODE_MAX +0x01 */ "TRIE_next_fail", /* REGNODE_MAX +0x02 */ @@ -711,7 +726,7 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__ = { EXTCONST U8 PL_varies_bitmask[]; #else EXTCONST U8 PL_varies_bitmask[] = { - 0x00, 0x00, 0x00, 0x30, 0x80, 0x3F, 0xFF, 0xCB, 0x00, 0x00, 0x00, 0x00 + 0x00, 0x00, 0x00, 0x60, 0x00, 0xFC, 0xF9, 0x5F, 0x06, 0x00, 0x00, 0x00 }; #endif /* DOINIT */ @@ -723,8 +738,8 @@ EXTCONST U8 PL_varies_bitmask[] = { EXTCONST U8 PL_simple[] __attribute__deprecated__; #else EXTCONST U8 PL_simple[] __attribute__deprecated__ = { - REG_ANY, SANY, CANY, ANYOF, POSIXD, POSIXL, POSIXU, POSIXA, NPOSIXD, - NPOSIXL, NPOSIXU, NPOSIXA, + REG_ANY, SANY, CANY, ANYOF, ANYOFL, POSIXD, POSIXL, POSIXU, POSIXA, + NPOSIXD, NPOSIXL, NPOSIXU, NPOSIXA, 0 }; #endif /* DOINIT */ @@ -733,7 +748,7 @@ EXTCONST U8 PL_simple[] __attribute__deprecated__ = { EXTCONST U8 PL_simple_bitmask[]; #else EXTCONST U8 PL_simple_bitmask[] = { - 0x00, 0x00, 0xFF, 0x0F, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 + 0x00, 0x00, 0xFF, 0x1F, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; #endif /* DOINIT */ diff --git a/sv.c b/sv.c index 1f9ea87..94740d3 100644 --- a/sv.c +++ b/sv.c @@ -14588,6 +14588,9 @@ perl_clone_using(PerlInterpreter *proto_perl, UV flags, /* Unicode features (see perlrun/-C) */ PL_unicode = proto_perl->Iunicode; + /* Should we warn if uses locale? */ + PL_warn_locale = proto_perl->Iwarn_locale; + /* Pre-5.8 signals control */ PL_signals = proto_perl->Isignals; diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec index 73696df..0c6a16a 100644 --- a/t/lib/warnings/regexec +++ b/t/lib/warnings/regexec @@ -117,3 +117,29 @@ $_ = 'a' x (2**15+1); # EXPECT +######## +# NAME Wide character in non-UTF-8 locale +eval { require POSIX; POSIX->import("locale_h") }; +if ($@) { + print("SKIPPED\n# no POSIX\n"),exit; +} +use warnings 'locale'; +use locale; +setlocale(&POSIX::LC_CTYPE, "C"); +"\x{100}" =~ /\x{100}|\x{101}/il; +"\x{100}" =~ /\x{100}|\x{101}/l; +"\x{100}" =~ /\w/l; +"\x{100}" =~ /\x{100}+/l; +"\x{100}" =~ /[\x{100}\x{102}]/l; +no warnings 'locale'; +EXPECT +Wide character (U+100) in pattern match (m//) at - line 8. +Wide character (U+100) in pattern match (m//) at - line 8. +Wide character (U+100) in pattern match (m//) at - line 9. +Wide character (U+100) in pattern match (m//) at - line 9. +Wide character (U+100) in pattern match (m//) at - line 9. +Wide character (U+100) in pattern match (m//) at - line 10. +Wide character (U+100) in pattern match (m//) at - line 10. +Wide character (U+100) in pattern match (m//) at - line 11. +Wide character (U+100) in pattern match (m//) at - line 12. +Wide character (U+100) in pattern match (m//) at - line 12. diff --git a/t/lib/warnings/utf8 b/t/lib/warnings/utf8 index 3690ce1..75f3f25 100644 --- a/t/lib/warnings/utf8 +++ b/t/lib/warnings/utf8 @@ -597,9 +597,46 @@ $a = fc("\x{1E9E}"); $a = fc("\x{FB05}"); $a = uc("\x{FB00}"); $a = ucfirst("\x{149}"); +$a = lcfirst("\x{178}"); +no warnings 'locale'; +$a = lc("\x{178}"); +$a = fc("\x{1E9E}"); +$a = fc("\x{FB05}"); +$a = uc("\x{FB00}"); +$a = ucfirst("\x{149}"); +$a = lcfirst("\x{178}"); EXPECT Can't do lc("\x{178}") on non-UTF-8 locale; resolved to "\x{178}". at - line 10. Can't do fc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{17F}\x{17F}". at - line 11. Can't do fc("\x{FB05}") on non-UTF-8 locale; resolved to "\x{FB06}". at - line 12. Can't do uc("\x{FB00}") on non-UTF-8 locale; resolved to "\x{FB00}". at - line 13. Can't do ucfirst("\x{149}") on non-UTF-8 locale; resolved to "\x{149}". at - line 14. +Can't do lcfirst("\x{178}") on non-UTF-8 locale; resolved to "\x{178}". at - line 15. +######## +# NAME Wide character in non-UTF-8 locale +eval { require POSIX; POSIX->import("locale_h") }; +if ($@) { + print("SKIPPED\n# no POSIX\n"),exit; +} +use warnings 'locale'; +use feature 'fc'; +use locale; +setlocale(&POSIX::LC_CTYPE, "C"); +my $a; +$a = lc("\x{100}"); +$a = lcfirst("\x{101}"); +$a = fc("\x{102}"); +$a = uc("\x{103}"); +$a = ucfirst("\x{104}"); +no warnings 'locale'; +$a = lc("\x{100}"); +$a = lcfirst("\x{101}"); +$a = fc("\x{102}"); +$a = uc("\x{103}"); +$a = ucfirst("\x{104}"); +EXPECT +Wide character (U+100) in lc at - line 10. +Wide character (U+101) in lcfirst at - line 11. +Wide character (U+102) in fc at - line 12. +Wide character (U+103) in uc at - line 13. +Wide character (U+104) in ucfirst at - line 14. diff --git a/t/re/charset.t b/t/re/charset.t index 4d0d99c..e061916 100644 --- a/t/re/charset.t +++ b/t/re/charset.t @@ -9,6 +9,8 @@ BEGIN { use strict; use warnings; +no warnings 'locale'; # Some /l tests use above-latin1 chars to make sure + # they work, even though they warn. use Config; plan('no_plan'); diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index c210e2e..19d6fbc 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -866,6 +866,7 @@ sub run_tests { ok "\x{100}X" =~ /$re/, "S_cl_and ANYOF_UNICODE & ANYOF_INVERTED"; my $loc_re = qq /(?l:^([^X]*)X)/; utf8::upgrade ($loc_re); + no warnings 'locale'; ok "\x{100}X" =~ /$loc_re/, "locale, S_cl_and ANYOF_UNICODE & ANYOF_INVERTED"; } diff --git a/utf8.c b/utf8.c index f328372..8551e11 100644 --- a/utf8.c +++ b/utf8.c @@ -1600,22 +1600,23 @@ Perl__to_uni_fold_flags(pTHX_ UV c, U8* p, STRLEN *lenp, U8 flags) PERL_ARGS_ASSERT__TO_UNI_FOLD_FLAGS; - /* Tread a UTF-8 locale as not being in locale at all */ - if (IN_UTF8_CTYPE_LOCALE) { - flags &= ~FOLD_FLAGS_LOCALE; + if (flags & FOLD_FLAGS_LOCALE) { + /* Treat a UTF-8 locale as not being in locale at all */ + if (IN_UTF8_CTYPE_LOCALE) { + flags &= ~FOLD_FLAGS_LOCALE; + } + else { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + goto needs_full_generality; + } } if (c < 256) { - UV result = _to_fold_latin1((U8) c, p, lenp, + return _to_fold_latin1((U8) c, p, lenp, flags & (FOLD_FLAGS_FULL | FOLD_FLAGS_NOMIX_ASCII)); - /* It is illegal for the fold to cross the 255/256 boundary under - * locale; in this case return the original */ - return (result > 256 && flags & FOLD_FLAGS_LOCALE) - ? c - : result; } - /* If no special needs, just use the macro */ + /* Here, above 255. If no special needs, just use the macro */ if ( ! (flags & (FOLD_FLAGS_LOCALE|FOLD_FLAGS_NOMIX_ASCII))) { uvchr_to_utf8(p, c); return CALL_FOLD_CASE(p, p, lenp, flags & FOLD_FLAGS_FULL); @@ -1623,6 +1624,8 @@ Perl__to_uni_fold_flags(pTHX_ UV c, U8* p, STRLEN *lenp, U8 flags) else { /* Otherwise, _to_utf8_fold_flags has the intelligence to deal with the special flags. */ U8 utf8_c[UTF8_MAXBYTES + 1]; + + needs_full_generality: uvchr_to_utf8(utf8_c, c); return _to_utf8_fold_flags(utf8_c, p, lenp, flags); } @@ -1876,7 +1879,7 @@ Perl_to_utf8_case(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, } STATIC UV -S_check_locale_boundary_crossing(pTHX_ const char * const func_name, const U8* const p, const UV result, U8* const ustrp, STRLEN *lenp) +S_check_locale_boundary_crossing(pTHX_ const U8* const p, const UV result, U8* const ustrp, STRLEN *lenp) { /* This is called when changing the case of a utf8-encoded character above * the Latin1 range, and the operation is in a non-UTF-8 locale. If the @@ -1911,7 +1914,8 @@ S_check_locale_boundary_crossing(pTHX_ const char * const func_name, const U8* c s += UTF8SKIP(s); } - /* Here, no characters crossed, result is ok as-is */ + /* Here, no characters crossed, result is ok as-is, but we warn. */ + _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(p, p + UTF8SKIP(p)); return result; } @@ -1924,7 +1928,7 @@ bad_crossing: Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), "Can't do %s(\"\\x{%"UVXf"}\") on non-UTF-8 locale; " "resolved to \"\\x{%"UVXf"}\".", - func_name, + OP_DESC(PL_op), original, original); Copy(p, ustrp, *lenp, char); @@ -1949,8 +1953,14 @@ Perl__to_utf8_upper_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_UPPER_FLAGS; - if (flags && IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; + if (flags) { + /* Treat a UTF-8 locale as not being in locale at all */ + if (IN_UTF8_CTYPE_LOCALE) { + flags = FALSE; + } + else { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + } } if (UTF8_IS_INVARIANT(*p)) { @@ -1975,7 +1985,7 @@ Perl__to_utf8_upper_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags result = CALL_UPPER_CASE(p, ustrp, lenp); if (flags) { - result = check_locale_boundary_crossing("uc", p, result, ustrp, lenp); + result = check_locale_boundary_crossing(p, result, ustrp, lenp); } return result; } @@ -2014,8 +2024,14 @@ Perl__to_utf8_title_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_TITLE_FLAGS; - if (flags && IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; + if (flags) { + /* Treat a UTF-8 locale as not being in locale at all */ + if (IN_UTF8_CTYPE_LOCALE) { + flags = FALSE; + } + else { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + } } if (UTF8_IS_INVARIANT(*p)) { @@ -2040,7 +2056,7 @@ Perl__to_utf8_title_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags result = CALL_TITLE_CASE(p, ustrp, lenp); if (flags) { - result = check_locale_boundary_crossing("ucfirst", p, result, ustrp, lenp); + result = check_locale_boundary_crossing(p, result, ustrp, lenp); } return result; } @@ -2078,8 +2094,14 @@ Perl__to_utf8_lower_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_LOWER_FLAGS; - if (flags && IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; + if (flags) { + /* Treat a UTF-8 locale as not being in locale at all */ + if (IN_UTF8_CTYPE_LOCALE) { + flags = FALSE; + } + else { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + } } if (UTF8_IS_INVARIANT(*p)) { @@ -2104,7 +2126,7 @@ Perl__to_utf8_lower_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags result = CALL_LOWER_CASE(p, ustrp, lenp); if (flags) { - result = check_locale_boundary_crossing("lc", p, result, ustrp, lenp); + result = check_locale_boundary_crossing(p, result, ustrp, lenp); } return result; @@ -2153,8 +2175,14 @@ Perl__to_utf8_fold_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, U8 flags) assert(p != ustrp); /* Otherwise overwrites */ - if (flags & FOLD_FLAGS_LOCALE && IN_UTF8_CTYPE_LOCALE) { - flags &= ~FOLD_FLAGS_LOCALE; + if (flags & FOLD_FLAGS_LOCALE) { + /* Treat a UTF-8 locale as not being in locale at all */ + if (IN_UTF8_CTYPE_LOCALE) { + flags &= ~FOLD_FLAGS_LOCALE; + } + else { + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; + } } if (UTF8_IS_INVARIANT(*p)) { @@ -2208,7 +2236,7 @@ Perl__to_utf8_fold_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, U8 flags) "resolved to \"\\x{FB06}\"."); goto return_ligature_st; } - return check_locale_boundary_crossing("fc", p, result, ustrp, lenp); + return check_locale_boundary_crossing(p, result, ustrp, lenp); } else if (! (flags & FOLD_FLAGS_NOMIX_ASCII)) { return result; @@ -3911,8 +3939,18 @@ L<http://www.unicode.org/unicode/reports/tr21/> (Case Mappings). * FOLDEQ_LOCALE is set iff the rules from the current underlying * locale are to be used. * FOLDEQ_S1_ALREADY_FOLDED s1 has already been folded before calling this - * routine. This allows that step to be skipped. + * routine. This allows that step to be skipped. + * Currently, this requires s1 to be encoded as UTF-8 + * (u1 must be true), which is asserted for. + * FOLDEQ_S1_FOLDS_SANE With either NOMIX_ASCII or LOCALE, no folds may + * cross certain boundaries. Hence, the caller should + * let this function do the folding instead of + * pre-folding. This code contains an assertion to + * that effect. However, if the caller knows what + * it's doing, it can pass this flag to indicate that, + * and the assertion is skipped. * FOLDEQ_S2_ALREADY_FOLDED Similarly. + * FOLDEQ_S2_FOLDS_SANE */ I32 Perl_foldEQ_utf8_flags(pTHX_ const char *s1, char **pe1, UV l1, bool u1, const char *s2, char **pe2, UV l2, bool u2, U32 flags) @@ -3928,11 +3966,15 @@ Perl_foldEQ_utf8_flags(pTHX_ const char *s1, char **pe1, UV l1, bool u1, const c STRLEN n1 = 0, n2 = 0; /* Number of bytes in current char */ U8 foldbuf1[UTF8_MAXBYTES_CASE+1]; U8 foldbuf2[UTF8_MAXBYTES_CASE+1]; + U8 flags_for_folder = FOLD_FLAGS_FULL; PERL_ARGS_ASSERT_FOLDEQ_UTF8_FLAGS; assert( ! ((flags & (FOLDEQ_UTF8_NOMIX_ASCII | FOLDEQ_LOCALE)) - && (flags & (FOLDEQ_S1_ALREADY_FOLDED | FOLDEQ_S2_ALREADY_FOLDED)))); + && (((flags & FOLDEQ_S1_ALREADY_FOLDED) + && !(flags & FOLDEQ_S1_FOLDS_SANE)) + || ((flags & FOLDEQ_S2_ALREADY_FOLDED) + && !(flags & FOLDEQ_S2_FOLDS_SANE))))); /* The algorithm is to trial the folds without regard to the flags on * the first line of the above assert(), and then see if the result * violates them. This means that the inputs can't be pre-folded to a @@ -3944,8 +3986,13 @@ Perl_foldEQ_utf8_flags(pTHX_ const char *s1, char **pe1, UV l1, bool u1, const c * and /iaa matches are most likely to involve code points 0-255, and this * function only under rare conditions gets called for 0-255. */ - if (IN_UTF8_CTYPE_LOCALE) { - flags &= ~FOLDEQ_LOCALE; + if (flags & FOLDEQ_LOCALE) { + if (IN_UTF8_CTYPE_LOCALE) { + flags &= ~FOLDEQ_LOCALE; + } + else { + flags_for_folder |= FOLD_FLAGS_LOCALE; + } } if (pe1) { @@ -3997,98 +4044,59 @@ Perl_foldEQ_utf8_flags(pTHX_ const char *s1, char **pe1, UV l1, bool u1, const c while (p1 < e1 && p2 < e2) { /* If at the beginning of a new character in s1, get its fold to use - * and the length of the fold. (exception: locale rules just get the - * character to a single byte) */ + * and the length of the fold. */ if (n1 == 0) { if (flags & FOLDEQ_S1_ALREADY_FOLDED) { f1 = (U8 *) p1; + assert(u1); n1 = UTF8SKIP(f1); } else { - /* If in locale matching, we use two sets of rules, depending - * on if the code point is above or below 255. Here, we test - * for and handle locale rules */ - if ((flags & FOLDEQ_LOCALE) - && (! u1 || ! UTF8_IS_ABOVE_LATIN1(*p1))) - { - /* There is no mixing of code points above and below 255. */ - if (u2 && UTF8_IS_ABOVE_LATIN1(*p2)) { - return 0; - } - - /* We handle locale rules by converting, if necessary, the - * code point to a single byte. */ - if (! u1 || UTF8_IS_INVARIANT(*p1)) { - *foldbuf1 = *p1; - } - else { **** PATCH TRUNCATED AT 2000 LINES -- 120 NOT SHOWN **** -- Perl5 Master Repository
