In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/7527883f8c7b71d808abdbd3cff07f61280a42b5?hp=427fbfe878efea40f50caa8b0da22803460f50b0>
- Log ----------------------------------------------------------------- commit 7527883f8c7b71d808abdbd3cff07f61280a42b5 Author: Karl Williamson <[email protected]> Date: Thu Dec 22 18:36:33 2016 -0700 Silence win32 compiler warning The function's parameter was not declared const in embed.fnc, but was in the function itself. M embed.fnc M proto.h commit 10085f4d25c1ff267f8ba5b37f225245a73a7c66 Author: Karl Williamson <[email protected]> Date: Thu Dec 22 18:24:26 2016 -0700 toke.c: Silence win32 compiler warning. M toke.c commit eaf412bfffcc8de096d1389f01cecba0beca4c8d Author: Karl Williamson <[email protected]> Date: Sat Dec 17 17:25:29 2016 -0700 utf8.c Extract common code into macros The 3 case changing functions: to upper, lower, and title case are essentially identical except for what they call to actually do the change; those being different macros or functions. The fourth function, to fold, is identical to the other three for the first part of its code, but diverges at the end in order to handle some special cases. This commit replaces the first part of the bodies of these 4 functions by a common macro. And it replaces the remainder of the first 3 functions by another common macro. I'm not a fan of this kind of macro to use in generating code, but it seems the best way to keep these definitions in sync. (It has to be a macro instead of a function because one of the parameters is a macro, which you can't pass to a function. I suppose one could create functions that just calls their macro, and get around it that way, but it doesn't seem worth it.) This commit just moved the code to the macro, and I manually verified that there were no logic changes. 1 of the passed-in functions requires one less argument (the final one) than the other 3. I originally tried to do something with the C preprocessor to get around that, but it didn't work with the Win32 version of the preprocessor, so I gave up and added a dummy parameter to the fourth function, which is static so that's ok to do. Below, for the record is my original attempt: /* These two macros are used to make optional a parameter to the * passed-in function to the macros just above. If the passed-in * function doesn't take the parameter, use PLACEHOLDER in the macro * call; otherwise surround the parameter by a PARAM() call */ #define PARAM(parameter) ,parameter #define PLACEHOLDER /* Something for the preprocessor to grab onto */ And within the macro, it called the function like this: L1_func(*p, ustrp, lenp/*,*/ L1_func_extra_param) M embed.fnc M proto.h M utf8.c commit 42c03a9af95fa27e50f4171d10a497b969c5f994 Author: Karl Williamson <[email protected]> Date: Sun Dec 18 13:38:01 2016 -0700 APItest/t/handy.t: Bring final special case into loop All the tests in this file are now in two loops, one for the isFOO() macros, and the other for the toFOO() macros. Thus the main logic applies to all, and tests can be added or changed easily. M ext/XS-APItest/t/handy.t commit c5546fac5a0a0ed4f70fea3b9306726d88fc73d7 Author: Karl Williamson <[email protected]> Date: Sun Dec 18 13:17:45 2016 -0700 APItest/t/handy.t: White-space only Indent newly formed block M ext/XS-APItest/t/handy.t commit a7fe852818de13803b0f9beaf831e85763759980 Author: Karl Williamson <[email protected]> Date: Sun Dec 18 12:40:06 2016 -0700 APItest/t/handy.t: Add more tests Macros with the '_uvchr' suffix were not being tested at all. Instead, the undocumented backwards-compatibility-only macros with the suffixes _uni were being tested, but these might diverge, and the tests wouldn't find that. M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/handy.t commit ee9e5f10a3bc4517b1603ce48b2d385c4ce17f86 Author: Karl Williamson <[email protected]> Date: Sun Dec 18 11:55:49 2016 -0700 APItest/t/handy.t: Add more tests The macros like isALPHA() were not getting tested; instead the theory being that testing isALPHA_A() was good enough because they are #defined to be the same. But that might change and the tests wouldn't uncover that. And it turned out that some things wern't getting tested at all if there was no _A version of the macro, for example isALNUM(). This commit adds test for the version of the isFOO() macros with no suffix. M ext/XS-APItest/APItest.pm M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/handy.t commit 4acecd396e371e6143fc830c34bb1b080c6d7865 Author: Karl Williamson <[email protected]> Date: Sat Dec 17 19:43:28 2016 -0700 APItest/t/handy.t: Use abbrev. char name in test names I got tired of seeing all these long character names fly by on my screen while testing, so this changes to use any official Unicode abbreviation when available. It's kind of silly to do this in this test, but I might extract and improve this for more general use in tests of characters in the future. This also changes some imports so that the full module name need not always be specified. M ext/XS-APItest/t/handy.t commit 9a00936322c17c0848302c7df2f560fbbc96c0a5 Author: Karl Williamson <[email protected]> Date: Sat Dec 17 19:22:14 2016 -0700 APItest/t/handy.t: White-space only indent newly formed block. M ext/XS-APItest/t/handy.t commit 8dc91396df45c913a676b9e1b92df92fa79ba3ce Author: Karl Williamson <[email protected]> Date: Sat Dec 17 19:19:39 2016 -0700 APItest/t/handy.t: Fold in another special case The previous commit revamped this .t to make most things part of a single loop. This adds another thing that was outside it. M ext/XS-APItest/t/handy.t commit 01a11ab935b8ca4ea1086026eeedfcd839a73826 Author: Karl Williamson <[email protected]> Date: Thu Dec 15 16:12:30 2016 -0700 APItest/t/handy.t: Refactor for maintenance Over the years code has kept getting copied and modified slightly in each new place. And a future commit would create still more. This cuts down the number of slightly different versions to the minimum reasonably attainable. M ext/XS-APItest/t/handy.t ----------------------------------------------------------------------- Summary of changes: embed.fnc | 5 +- ext/XS-APItest/APItest.pm | 2 +- ext/XS-APItest/APItest.xs | 339 ++++++++++++++++++++++++++++++ ext/XS-APItest/t/handy.t | 513 ++++++++++++++++++++++++---------------------- proto.h | 4 +- toke.c | 3 +- utf8.c | 280 +++++++++---------------- 7 files changed, 709 insertions(+), 437 deletions(-) diff --git a/embed.fnc b/embed.fnc index ca15006dac..49cf3f4941 100644 --- a/embed.fnc +++ b/embed.fnc @@ -529,7 +529,7 @@ i |void |op_relocate_sv |NN SV** svp|NN PADOFFSET* targp i |OP* |newMETHOP_internal |I32 type|I32 flags|NULLOK OP* dynamic_meth \ |NULLOK SV* const_meth : FIXME -s |OP* |fold_constants |NN OP *o +s |OP* |fold_constants |NN OP * const o #endif Afpd |char* |form |NN const char* pat|... Ap |char* |vform |NN const char* pat|NULLOK va_list* args @@ -738,7 +738,8 @@ AMp |UV |to_uni_title |UV c|NN U8 *p|NN STRLEN *lenp ADMpR |bool |isIDFIRST_lazy |NN const char* p ADMpR |bool |isALNUM_lazy |NN const char* p #ifdef PERL_IN_UTF8_C -snR |U8 |to_lower_latin1|const U8 c|NULLOK U8 *p|NULLOK STRLEN *lenp +snR |U8 |to_lower_latin1|const U8 c|NULLOK U8 *p|NULLOK STRLEN *lenp \ + |const char dummy inR |bool |is_utf8_cp_above_31_bits|NN const U8 * const s|NN const U8 * const e #endif #if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) diff --git a/ext/XS-APItest/APItest.pm b/ext/XS-APItest/APItest.pm index 473d4a352e..1be011660b 100644 --- a/ext/XS-APItest/APItest.pm +++ b/ext/XS-APItest/APItest.pm @@ -5,7 +5,7 @@ use strict; use warnings; use Carp; -our $VERSION = '0.87'; +our $VERSION = '0.88'; require XSLoader; diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs index c58e248c89..8b4e638484 100644 --- a/ext/XS-APItest/APItest.xs +++ b/ext/XS-APItest/APItest.xs @@ -4372,6 +4372,13 @@ test_isBLANK_uni(UV ord) RETVAL bool +test_isBLANK_uvchr(UV ord) + CODE: + RETVAL = isBLANK_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isBLANK_LC_uvchr(UV ord) CODE: RETVAL = isBLANK_LC_uvchr(ord); @@ -4379,6 +4386,13 @@ test_isBLANK_LC_uvchr(UV ord) RETVAL bool +test_isBLANK(UV ord) + CODE: + RETVAL = isBLANK(ord); + OUTPUT: + RETVAL + +bool test_isBLANK_A(UV ord) CODE: RETVAL = isBLANK_A(ord); @@ -4421,6 +4435,13 @@ test_isVERTWS_uni(UV ord) RETVAL bool +test_isVERTWS_uvchr(UV ord) + CODE: + RETVAL = isVERTWS_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isVERTWS_utf8(unsigned char * p) CODE: RETVAL = isVERTWS_utf8(p); @@ -4435,6 +4456,13 @@ test_isUPPER_uni(UV ord) RETVAL bool +test_isUPPER_uvchr(UV ord) + CODE: + RETVAL = isUPPER_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isUPPER_LC_uvchr(UV ord) CODE: RETVAL = isUPPER_LC_uvchr(ord); @@ -4442,6 +4470,13 @@ test_isUPPER_LC_uvchr(UV ord) RETVAL bool +test_isUPPER(UV ord) + CODE: + RETVAL = isUPPER(ord); + OUTPUT: + RETVAL + +bool test_isUPPER_A(UV ord) CODE: RETVAL = isUPPER_A(ord); @@ -4484,6 +4519,13 @@ test_isLOWER_uni(UV ord) RETVAL bool +test_isLOWER_uvchr(UV ord) + CODE: + RETVAL = isLOWER_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isLOWER_LC_uvchr(UV ord) CODE: RETVAL = isLOWER_LC_uvchr(ord); @@ -4491,6 +4533,13 @@ test_isLOWER_LC_uvchr(UV ord) RETVAL bool +test_isLOWER(UV ord) + CODE: + RETVAL = isLOWER(ord); + OUTPUT: + RETVAL + +bool test_isLOWER_A(UV ord) CODE: RETVAL = isLOWER_A(ord); @@ -4533,6 +4582,13 @@ test_isALPHA_uni(UV ord) RETVAL bool +test_isALPHA_uvchr(UV ord) + CODE: + RETVAL = isALPHA_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isALPHA_LC_uvchr(UV ord) CODE: RETVAL = isALPHA_LC_uvchr(ord); @@ -4540,6 +4596,13 @@ test_isALPHA_LC_uvchr(UV ord) RETVAL bool +test_isALPHA(UV ord) + CODE: + RETVAL = isALPHA(ord); + OUTPUT: + RETVAL + +bool test_isALPHA_A(UV ord) CODE: RETVAL = isALPHA_A(ord); @@ -4582,6 +4645,13 @@ test_isWORDCHAR_uni(UV ord) RETVAL bool +test_isWORDCHAR_uvchr(UV ord) + CODE: + RETVAL = isWORDCHAR_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isWORDCHAR_LC_uvchr(UV ord) CODE: RETVAL = isWORDCHAR_LC_uvchr(ord); @@ -4589,6 +4659,13 @@ test_isWORDCHAR_LC_uvchr(UV ord) RETVAL bool +test_isWORDCHAR(UV ord) + CODE: + RETVAL = isWORDCHAR(ord); + OUTPUT: + RETVAL + +bool test_isWORDCHAR_A(UV ord) CODE: RETVAL = isWORDCHAR_A(ord); @@ -4631,6 +4708,13 @@ test_isALPHANUMERIC_uni(UV ord) RETVAL bool +test_isALPHANUMERIC_uvchr(UV ord) + CODE: + RETVAL = isALPHANUMERIC_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isALPHANUMERIC_LC_uvchr(UV ord) CODE: RETVAL = isALPHANUMERIC_LC_uvchr(ord); @@ -4638,6 +4722,13 @@ test_isALPHANUMERIC_LC_uvchr(UV ord) RETVAL bool +test_isALPHANUMERIC(UV ord) + CODE: + RETVAL = isALPHANUMERIC(ord); + OUTPUT: + RETVAL + +bool test_isALPHANUMERIC_A(UV ord) CODE: RETVAL = isALPHANUMERIC_A(ord); @@ -4673,6 +4764,13 @@ test_isALPHANUMERIC_LC_utf8(unsigned char * p) RETVAL bool +test_isALNUM(UV ord) + CODE: + RETVAL = isALNUM(ord); + OUTPUT: + RETVAL + +bool test_isALNUM_uni(UV ord) CODE: RETVAL = isALNUM_uni(ord); @@ -4715,6 +4813,13 @@ test_isDIGIT_uni(UV ord) RETVAL bool +test_isDIGIT_uvchr(UV ord) + CODE: + RETVAL = isDIGIT_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isDIGIT_LC_uvchr(UV ord) CODE: RETVAL = isDIGIT_LC_uvchr(ord); @@ -4736,6 +4841,13 @@ test_isDIGIT_LC_utf8(unsigned char * p) RETVAL bool +test_isDIGIT(UV ord) + CODE: + RETVAL = isDIGIT(ord); + OUTPUT: + RETVAL + +bool test_isDIGIT_A(UV ord) CODE: RETVAL = isDIGIT_A(ord); @@ -4757,6 +4869,13 @@ test_isDIGIT_LC(UV ord) RETVAL bool +test_isOCTAL(UV ord) + CODE: + RETVAL = isOCTAL(ord); + OUTPUT: + RETVAL + +bool test_isOCTAL_A(UV ord) CODE: RETVAL = isOCTAL_A(ord); @@ -4778,6 +4897,13 @@ test_isIDFIRST_uni(UV ord) RETVAL bool +test_isIDFIRST_uvchr(UV ord) + CODE: + RETVAL = isIDFIRST_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isIDFIRST_LC_uvchr(UV ord) CODE: RETVAL = isIDFIRST_LC_uvchr(ord); @@ -4785,6 +4911,13 @@ test_isIDFIRST_LC_uvchr(UV ord) RETVAL bool +test_isIDFIRST(UV ord) + CODE: + RETVAL = isIDFIRST(ord); + OUTPUT: + RETVAL + +bool test_isIDFIRST_A(UV ord) CODE: RETVAL = isIDFIRST_A(ord); @@ -4827,6 +4960,13 @@ test_isIDCONT_uni(UV ord) RETVAL bool +test_isIDCONT_uvchr(UV ord) + CODE: + RETVAL = isIDCONT_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isIDCONT_LC_uvchr(UV ord) CODE: RETVAL = isIDCONT_LC_uvchr(ord); @@ -4834,6 +4974,13 @@ test_isIDCONT_LC_uvchr(UV ord) RETVAL bool +test_isIDCONT(UV ord) + CODE: + RETVAL = isIDCONT(ord); + OUTPUT: + RETVAL + +bool test_isIDCONT_A(UV ord) CODE: RETVAL = isIDCONT_A(ord); @@ -4876,6 +5023,13 @@ test_isSPACE_uni(UV ord) RETVAL bool +test_isSPACE_uvchr(UV ord) + CODE: + RETVAL = isSPACE_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isSPACE_LC_uvchr(UV ord) CODE: RETVAL = isSPACE_LC_uvchr(ord); @@ -4883,6 +5037,13 @@ test_isSPACE_LC_uvchr(UV ord) RETVAL bool +test_isSPACE(UV ord) + CODE: + RETVAL = isSPACE(ord); + OUTPUT: + RETVAL + +bool test_isSPACE_A(UV ord) CODE: RETVAL = isSPACE_A(ord); @@ -4925,6 +5086,13 @@ test_isASCII_uni(UV ord) RETVAL bool +test_isASCII_uvchr(UV ord) + CODE: + RETVAL = isASCII_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isASCII_LC_uvchr(UV ord) CODE: RETVAL = isASCII_LC_uvchr(ord); @@ -4932,6 +5100,13 @@ test_isASCII_LC_uvchr(UV ord) RETVAL bool +test_isASCII(UV ord) + CODE: + RETVAL = isASCII(ord); + OUTPUT: + RETVAL + +bool test_isASCII_A(UV ord) CODE: RETVAL = isASCII_A(ord); @@ -4974,6 +5149,13 @@ test_isCNTRL_uni(UV ord) RETVAL bool +test_isCNTRL_uvchr(UV ord) + CODE: + RETVAL = isCNTRL_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isCNTRL_LC_uvchr(UV ord) CODE: RETVAL = isCNTRL_LC_uvchr(ord); @@ -4981,6 +5163,13 @@ test_isCNTRL_LC_uvchr(UV ord) RETVAL bool +test_isCNTRL(UV ord) + CODE: + RETVAL = isCNTRL(ord); + OUTPUT: + RETVAL + +bool test_isCNTRL_A(UV ord) CODE: RETVAL = isCNTRL_A(ord); @@ -5023,6 +5212,13 @@ test_isPRINT_uni(UV ord) RETVAL bool +test_isPRINT_uvchr(UV ord) + CODE: + RETVAL = isPRINT_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isPRINT_LC_uvchr(UV ord) CODE: RETVAL = isPRINT_LC_uvchr(ord); @@ -5030,6 +5226,13 @@ test_isPRINT_LC_uvchr(UV ord) RETVAL bool +test_isPRINT(UV ord) + CODE: + RETVAL = isPRINT(ord); + OUTPUT: + RETVAL + +bool test_isPRINT_A(UV ord) CODE: RETVAL = isPRINT_A(ord); @@ -5072,6 +5275,13 @@ test_isGRAPH_uni(UV ord) RETVAL bool +test_isGRAPH_uvchr(UV ord) + CODE: + RETVAL = isGRAPH_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isGRAPH_LC_uvchr(UV ord) CODE: RETVAL = isGRAPH_LC_uvchr(ord); @@ -5079,6 +5289,13 @@ test_isGRAPH_LC_uvchr(UV ord) RETVAL bool +test_isGRAPH(UV ord) + CODE: + RETVAL = isGRAPH(ord); + OUTPUT: + RETVAL + +bool test_isGRAPH_A(UV ord) CODE: RETVAL = isGRAPH_A(ord); @@ -5121,6 +5338,13 @@ test_isPUNCT_uni(UV ord) RETVAL bool +test_isPUNCT_uvchr(UV ord) + CODE: + RETVAL = isPUNCT_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isPUNCT_LC_uvchr(UV ord) CODE: RETVAL = isPUNCT_LC_uvchr(ord); @@ -5128,6 +5352,13 @@ test_isPUNCT_LC_uvchr(UV ord) RETVAL bool +test_isPUNCT(UV ord) + CODE: + RETVAL = isPUNCT(ord); + OUTPUT: + RETVAL + +bool test_isPUNCT_A(UV ord) CODE: RETVAL = isPUNCT_A(ord); @@ -5170,6 +5401,13 @@ test_isXDIGIT_uni(UV ord) RETVAL bool +test_isXDIGIT_uvchr(UV ord) + CODE: + RETVAL = isXDIGIT_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isXDIGIT_LC_uvchr(UV ord) CODE: RETVAL = isXDIGIT_LC_uvchr(ord); @@ -5177,6 +5415,13 @@ test_isXDIGIT_LC_uvchr(UV ord) RETVAL bool +test_isXDIGIT(UV ord) + CODE: + RETVAL = isXDIGIT(ord); + OUTPUT: + RETVAL + +bool test_isXDIGIT_A(UV ord) CODE: RETVAL = isXDIGIT_A(ord); @@ -5219,6 +5464,13 @@ test_isPSXSPC_uni(UV ord) RETVAL bool +test_isPSXSPC_uvchr(UV ord) + CODE: + RETVAL = isPSXSPC_uvchr(ord); + OUTPUT: + RETVAL + +bool test_isPSXSPC_LC_uvchr(UV ord) CODE: RETVAL = isPSXSPC_LC_uvchr(ord); @@ -5226,6 +5478,13 @@ test_isPSXSPC_LC_uvchr(UV ord) RETVAL bool +test_isPSXSPC(UV ord) + CODE: + RETVAL = isPSXSPC(ord); + OUTPUT: + RETVAL + +bool test_isPSXSPC_A(UV ord) CODE: RETVAL = isPSXSPC_A(ord); @@ -5618,6 +5877,26 @@ test_toLOWER_uni(UV ord) RETVAL AV * +test_toLOWER_uvchr(UV ord) + PREINIT: + U8 s[UTF8_MAXBYTES_CASE + 1]; + STRLEN len; + AV *av; + SV *utf8; + CODE: + av = newAV(); + av_push(av, newSVuv(toLOWER_uvchr(ord, s, &len))); + + utf8 = newSVpvn((char *) s, len); + SvUTF8_on(utf8); + av_push(av, utf8); + + av_push(av, newSVuv(len)); + RETVAL = av; + OUTPUT: + RETVAL + +AV * test_toLOWER_utf8(SV * p) PREINIT: U8 *input; @@ -5674,6 +5953,26 @@ test_toFOLD_uni(UV ord) RETVAL AV * +test_toFOLD_uvchr(UV ord) + PREINIT: + U8 s[UTF8_MAXBYTES_CASE + 1]; + STRLEN len; + AV *av; + SV *utf8; + CODE: + av = newAV(); + av_push(av, newSVuv(toFOLD_uvchr(ord, s, &len))); + + utf8 = newSVpvn((char *) s, len); + SvUTF8_on(utf8); + av_push(av, utf8); + + av_push(av, newSVuv(len)); + RETVAL = av; + OUTPUT: + RETVAL + +AV * test_toFOLD_utf8(SV * p) PREINIT: U8 *input; @@ -5730,6 +6029,26 @@ test_toUPPER_uni(UV ord) RETVAL AV * +test_toUPPER_uvchr(UV ord) + PREINIT: + U8 s[UTF8_MAXBYTES_CASE + 1]; + STRLEN len; + AV *av; + SV *utf8; + CODE: + av = newAV(); + av_push(av, newSVuv(toUPPER_uvchr(ord, s, &len))); + + utf8 = newSVpvn((char *) s, len); + SvUTF8_on(utf8); + av_push(av, utf8); + + av_push(av, newSVuv(len)); + RETVAL = av; + OUTPUT: + RETVAL + +AV * test_toUPPER_utf8(SV * p) PREINIT: U8 *input; @@ -5779,6 +6098,26 @@ test_toTITLE_uni(UV ord) RETVAL AV * +test_toTITLE_uvchr(UV ord) + PREINIT: + U8 s[UTF8_MAXBYTES_CASE + 1]; + STRLEN len; + AV *av; + SV *utf8; + CODE: + av = newAV(); + av_push(av, newSVuv(toTITLE_uvchr(ord, s, &len))); + + utf8 = newSVpvn((char *) s, len); + SvUTF8_on(utf8); + av_push(av, utf8); + + av_push(av, newSVuv(len)); + RETVAL = av; + OUTPUT: + RETVAL + +AV * test_toTITLE_utf8(SV * p) PREINIT: U8 *input; diff --git a/ext/XS-APItest/t/handy.t b/ext/XS-APItest/t/handy.t index a85f701e99..b08e8146d3 100644 --- a/ext/XS-APItest/t/handy.t +++ b/ext/XS-APItest/t/handy.t @@ -11,33 +11,99 @@ use Config; use XS::APItest; -use Unicode::UCD qw(prop_invlist prop_invmap); +my $tab = " " x 4; # Indent subsidiary tests this much + +use Unicode::UCD qw(search_invlist prop_invmap prop_invlist); +my ($charname_list, $charname_map, $format, $default) = prop_invmap("Name Alias"); + +sub get_charname($) { + my $cp = shift; + + # If there is a an abbreviation for the code point name, use it + my $name_index = search_invlist(\@{$charname_list}, $cp); + if (defined $name_index) { + my $synonyms = $charname_map->[$name_index]; + if (ref $synonyms) { + my $pat = qr/: abbreviation/; + my @abbreviations = grep { $_ =~ $pat } @$synonyms; + if (@abbreviations) { + return $abbreviations[0] =~ s/$pat//r; + } + } + } + + # Otherwise, use the full name + use charnames (); + return charnames::viacode($cp) // "No name"; +} sub truth($) { # Converts values so is() works return (shift) ? 1 : 0; } -my $locale; +my $base_locale; my $utf8_locale; if(locales_enabled('LC_ALL')) { require POSIX; - $locale = POSIX::setlocale( &POSIX::LC_ALL, "C"); - if (defined $locale && $locale eq 'C') { + $base_locale = POSIX::setlocale( &POSIX::LC_ALL, "C"); + if (defined $base_locale && $base_locale eq 'C') { use locale; # make \w work right in non-ASCII lands # Some locale implementations don't have the 128-255 characters all # mean nothing. Skip the locale tests in that situation for my $i (128 .. 255) { if (chr(utf8::unicode_to_native($i)) =~ /[[:print:]]/) { - undef $locale; + undef $base_locale; last; } } - $utf8_locale = find_utf8_ctype_locale(); + $utf8_locale = find_utf8_ctype_locale() if $base_locale; } } +sub get_display_locale_or_skip($$) { + + # Helper function intimately tied to its callers. It knows the loop + # iterates with a locale of "", meaning don't use locale; $base_locale + # meaning to use a non-UTF-8 locale; and $utf8_locale. + # + # It checks to see if the current test should be skipped or executed, + # returning an empty list for the former, and for the latter: + # ( 'locale display name', + # bool of is this a UTF-8 locale ) + # + # The display name is the empty string if not using locale. Functions + # with _LC in their name are skipped unless in locale, and functions + # without _LC are executed only outside locale. However, if no locales at + # all are on the system, the _LC functions are executed outside locale. + + my ($locale, $suffix) = @_; + + # The test should be skipped if the input is for a non-existent locale + return unless defined $locale; + + # Here the input is defined, either a locale name or "". If the test is + # for not using locales, we want to do the test for non-LC functions, + # and skip it for LC ones (except if there are no locales on the system, + # we do it for LC ones as if they weren't LC). + if ($locale eq "") { + return ("", 0) if $suffix !~ /LC/ || ! defined $base_locale; + return; + } + + # Here the input is for a real locale. We don't test the non-LC functions + # for locales. + return if $suffix !~ /LC/; + + # Here is for a LC function and a real locale. The base locale is not + # UTF-8. + return (" ($locale locale)", 0) if $locale eq $base_locale; + + # The only other possibility is that we have a UTF-8 locale + return (" ($locale)", 1); +} + my %properties = ( # name => Lookup-property name alnum => 'Word', @@ -65,17 +131,24 @@ my %properties = ( my @warnings; local $SIG{__WARN__} = sub { push @warnings, @_ }; -use charnames (); -foreach my $name (sort keys %properties) { - my $property = $properties{$name}; - my @invlist = prop_invlist($property, '_perl_core_internal_ok'); - if (! @invlist) { - - # An empty return could mean an unknown property, or merely that it is - # empty. Call in scalar context to differentiate - if (! prop_invlist($property, '_perl_core_internal_ok')) { - fail("No inversion list found for $property"); - next; + +foreach my $name (sort keys %properties, 'octal') { + my @invlist; + if ($name eq 'octal') { + # Hand-roll an inversion list with 0-7 in it and nothing else. + push @invlist, ord "0", ord "8"; + } + else { + my $property = $properties{$name}; + @invlist = prop_invlist($property, '_perl_core_internal_ok'); + if (! @invlist) { + + # An empty return could mean an unknown property, or merely that + # it is empty. Call in scalar context to differentiate + if (! prop_invlist($property, '_perl_core_internal_ok')) { + fail("No inversion list found for $property"); + next; + } } } @@ -107,7 +180,11 @@ foreach my $name (sort keys %properties) { my $i = utf8::native_to_unicode($j); my $function = uc($name); - my $matches = Unicode::UCD::search_invlist(\@invlist, $i); + is (@warnings, 0, "Got no unexpected warnings in previous iteration") + or diag("@warnings"); + undef @warnings; + + my $matches = search_invlist(\@invlist, $i); if (! defined $matches) { $matches = 0; } @@ -116,165 +193,110 @@ foreach my $name (sort keys %properties) { } my $ret; - my $char_name = charnames::viacode($i) // "No name"; + my $char_name = get_charname($j); my $display_name = sprintf "\\x{%02X, %s}", $i, $char_name; + my $display_call = "is${function}( $display_name )"; - if ($name eq 'quotemeta') { # There is only one macro for this, and is - # defined only for Latin1 range - $ret = truth eval "test_is${function}($i)"; - if ($@) { - fail $@; - } - else { - my $truth = truth($matches && $i < 256); - is ($ret, $truth, "is${function}( $display_name ) == $truth"); - } - next; - } + foreach my $suffix ("", "_A", "_L1", "_LC", "_uni", "_uvchr", + "_LC_uvchr", "_utf8", "_LC_utf8") + { - # vertws is always all of Unicode; ALNUM_A and ALNUM_L1 are not - # defined as they were added later, after WORDCHAR was created to be a - # clearer synonym for ALNUM - if ($name ne 'vertws') { - if ($name ne 'alnum') { - $ret = truth eval "test_is${function}_A($i)"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches && utf8::native_to_unicode($i) < 128); - is ($ret, $truth, "is${function}_A( $display_name ) == $truth"); - } - $ret = truth eval "test_is${function}_L1($i)"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches && $i < 256); - is ($ret, $truth, "is${function}_L1( $display_name ) == $truth"); - } - } + # Not all possible macros have been defined + if ($name eq 'vertws') { - if (defined $locale) { - use locale; - POSIX::setlocale( &POSIX::LC_ALL, "C"); - $ret = truth eval "test_is${function}_LC($i)"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches && utf8::native_to_unicode($i) < 128); - is ($ret, $truth, "is${function}_LC( $display_name ) == $truth (C locale)"); - } + # vertws is always all of Unicode + next if $suffix !~ / ^ _ ( uni | uvchr | utf8 ) $ /x; } - - if (defined $utf8_locale) { - use locale; - - POSIX::setlocale( &POSIX::LC_ALL, $utf8_locale); - $ret = truth eval "test_is${function}_LC($i)"; - if ($@) { - fail($@); - } - else { - - # UTF-8 locale works on full range 0-255 - my $truth = truth($matches && $i < 256); - is ($ret, $truth, "is${function}_LC( $display_name ) == $truth ($utf8_locale)"); - } + elsif ($name eq 'alnum') { + + # ALNUM_A, ALNUM_L1, and ALNUM_uvchr are not defined as these + # suffixes were added later, after WORDCHAR was created to be + # a clearer synonym for ALNUM + next if $suffix eq '_A' + || $suffix eq '_L1' + || $suffix eq '_uvchr'; } - } - - $ret = truth eval "test_is${function}_uni($i)"; - if ($@) { - fail($@); - } - else { - is ($ret, $matches, "is${function}_uni( $display_name ) == $matches"); - } - - if (defined $locale && $name ne 'vertws') { - use locale; - POSIX::setlocale( &POSIX::LC_ALL, "C"); - $ret = truth eval "test_is${function}_LC_uvchr('$i')"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches && (utf8::native_to_unicode($i) < 128 || $i > 255)); - is ($ret, $truth, "is${function}_LC_uvchr( $display_name ) == $truth (C locale)"); - } - } - - if (defined $utf8_locale && $name ne 'vertws') { - use locale; - - POSIX::setlocale( &POSIX::LC_ALL, $utf8_locale); - $ret = truth eval "test_is${function}_LC_uvchr('$i')"; - if ($@) { - fail($@); + elsif ($name eq 'octal') { + next if $suffix ne "" && $suffix ne '_A' && $suffix ne '_L1'; } - else { - my $truth = truth($matches); - is ($ret, $truth, "is${function}_LC_uvchr( $display_name ) == $truth ($utf8_locale)"); + elsif ($name eq 'quotemeta') { + # There is only one macro for this, and is defined only for + # Latin1 range + next if $suffix ne "" } - } - my $char = chr($i); - utf8::upgrade($char); - $char = quotemeta $char if $char eq '\\' || $char eq "'"; - $ret = truth eval "test_is${function}_utf8('$char')"; - if ($@) { - fail($@); - } - else { - is ($ret, $matches, "is${function}_utf8( $display_name ) == $matches"); - } - - if ($name ne 'vertws' && defined $locale) { - use locale; - POSIX::setlocale( &POSIX::LC_ALL, "C"); - $ret = truth eval "test_is${function}_LC_utf8('$char')"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches && (utf8::native_to_unicode($i) < 128 || $i > 255)); - is ($ret, $truth, "is${function}_LC_utf8( $display_name ) == $truth (C locale)"); - } - } - - if ($name ne 'vertws' && defined $utf8_locale) { - use locale; + foreach my $locale ("", $base_locale, $utf8_locale) { + + my ($display_locale, $locale_is_utf8) + = get_display_locale_or_skip($locale, $suffix); + next unless defined $display_locale; + + use if $locale, "locale"; + POSIX::setlocale( &POSIX::LC_ALL, $locale) if $locale; + + if ($suffix !~ /utf8/) { # _utf8 has to handled specially + my $display_call + = "is${function}$suffix( $display_name )$display_locale"; + $ret = truth eval "test_is${function}$suffix($i)"; + if (is ($@, "", "$display_call didn't give error")) { + my $truth = $matches; + if ($truth) { + + # The single byte functions are false for + # above-Latin1 + if ($i >= 256) { + $truth = 0 + if $suffix=~ / ^ ( _A | _L [1C] )? $ /x; + } + elsif ( utf8::native_to_unicode($i) >= 128 + && $name ne 'quotemeta') + { + + # The no-suffix and _A functions are false + # for non-ASCII. So are _LC functions on a + # non-UTF-8 locale + $truth = 0 if $suffix eq "_A" + || $suffix eq "" + || ( $suffix =~ /LC/ + && ! $locale_is_utf8); + } + } + + is ($ret, $truth, "${tab}And correctly returns $truth"); + } + } + else { # _utf8 suffix + my $char = chr($i); + utf8::upgrade($char); + $char = quotemeta $char if $char eq '\\' || $char eq "'"; + my $truth; + if ( $suffix =~ /LC/ + && ! $locale_is_utf8 + && $i < 256 + && utf8::native_to_unicode($i) >= 128) + { # The C-locale _LC function returns FALSE for Latin1 + # above ASCII + $truth = 0; + } + else { + $truth = $matches; + } - POSIX::setlocale( &POSIX::LC_ALL, $utf8_locale); - $ret = truth eval "test_is${function}_LC_utf8('$char')"; - if ($@) { - fail($@); - } - else { - my $truth = truth($matches); - is ($ret, $truth, "is${function}_LC_utf8( $display_name ) == $truth ($utf8_locale)"); + my $display_call = "is${function}$suffix(" + . " $display_name )$display_locale"; + $ret = truth eval "test_is${function}$suffix('$char')"; + if (is ($@, "", "$display_call didn't give error")) { + is ($ret, $truth, + "${tab}And correctly returned $truth"); + } + } } } } } -# Test isOCTAL() -for my $i (0 .. 256, 0x110000) { - my $char_name = charnames::viacode($i) // "No name"; - my $display_name = sprintf "\\N{U+%02X, %s}", $i, $char_name; - my $truth = truth($i >= ord('0') && $i <= ord('7')); - - my $ret = truth test_isOCTAL_A($i); - is($ret, $truth, "isOCTAL_A( $display_name ) == $truth"); - - $ret = truth test_isOCTAL_L1($i); - is($ret, $truth, "isOCTAL_L1( $display_name ) == $truth"); -} - my %to_properties = ( - FOLD => 'Case_Folding', + FOLD => 'Case_Folding', LOWER => 'Lowercase_Mapping', TITLE => 'Titlecase_Mapping', UPPER => 'Uppercase_Mapping', @@ -305,12 +327,14 @@ foreach my $name (sort keys %to_properties) { my $range_start = $list_ref->[$i]; next if $range_start < 257; if (ref $map_ref->[$i] && $multi_char < 5) { - push @code_points, $range_start - 1 if $code_points[-1] != $range_start - 1; + push @code_points, $range_start - 1 + if $code_points[-1] != $range_start - 1; push @code_points, $range_start; $multi_char++; } elsif ($above_latins < 5) { - push @code_points, $range_start - 1 if $code_points[-1] != $range_start - 1; + push @code_points, $range_start - 1 + if $code_points[-1] != $range_start - 1; push @code_points, $range_start; $above_latins++; } @@ -326,83 +350,67 @@ foreach my $name (sort keys %to_properties) { my $i = utf8::native_to_unicode($j); my $function = $name; - my $index = Unicode::UCD::search_invlist(\@{$list_ref}, $j); + my $index = search_invlist(\@{$list_ref}, $j); my $ret; - my $char_name = charnames::viacode($j) // "No name"; + my $char_name = get_charname($j); my $display_name = sprintf "\\N{U+%02X, %s}", $i, $char_name; - # Test the base function - $ret = eval "test_to${function}($j)"; - if ($@) { - fail($@); - } - else { - my $should_be = ($i < 128 && $map_ref->[$index] != $missing) - ? $map_ref->[$index] + $j - $list_ref->[$index] - : $j; - is ($ret, $should_be, sprintf("to${function}( $display_name ) == 0x%02X", $should_be)); - } + foreach my $suffix ("", "_L1", "_LC") { - # Test _L1 - if ($name eq 'LOWER') { - $ret = eval "test_to${function}_L1($j)"; - if ($@) { - fail($@); - } - else { - my $should_be = ($i < 256 && $map_ref->[$index] != $missing) - ? $map_ref->[$index] + $j - $list_ref->[$index] - : $j; - is ($ret, $should_be, sprintf("to${function}_L1( $display_name ) == 0x%02X", $should_be)); - } - } + # This is the only macro defined for L1 + next if $suffix eq "_L1" && $function ne "LOWER"; - if ($name ne 'TITLE') { # Test _LC; titlecase is not defined in locales. - if (defined $locale) { - use locale; - POSIX::setlocale( &POSIX::LC_ALL, "C"); - $ret = eval "test_to${function}_LC($j)"; - if ($@) { - fail($@); - } - else { - my $should_be = ($i < 128 && $map_ref->[$index] != $missing) - ? $map_ref->[$index] + $j - $list_ref->[$index] - : $j; - is ($ret, $should_be, sprintf("to${function}_LC( $display_name ) == 0x%02X (C locale)", $should_be)); - } - } + SKIP: + foreach my $locale ("", $base_locale, $utf8_locale) { + + # titlecase is not defined in locales. + next if $name eq 'TITLE' && $suffix eq "_LC"; - if (defined $utf8_locale) { - use locale; + my ($display_locale, $locale_is_utf8) + = get_display_locale_or_skip($locale, $suffix); + next unless defined $display_locale; - SKIP: { - skip "to${property}_LC does not work for LATIN SMALL LETTER SHARP S", 1 - if $j == 0xDF && ($name eq 'FOLD' || $name eq 'UPPER'); + skip("to${name}_LC does not work for LATIN SMALL LETTER SHARP S" + . "$display_locale", 1) + if $i == 0xDF && $name =~ / FOLD | UPPER /x + && $suffix eq "_LC" && $locale_is_utf8; - POSIX::setlocale( &POSIX::LC_ALL, $utf8_locale); - $ret = eval "test_to${function}_LC($j)"; - if ($@) { - fail($@); + use if $locale, "locale"; + POSIX::setlocale( &POSIX::LC_ALL, $locale) if $locale; + + my $display_call = "to${function}$suffix(" + . " $display_name )$display_locale"; + $ret = eval "test_to${function}$suffix($j)"; + if (is ($@, "", "$display_call didn't give error")) { + my $should_be; + if ($i > 255) { + $should_be = $j; + } + elsif ( $i > 127 + && ( $suffix eq "" + || ($suffix eq "_LC" && ! $locale_is_utf8))) + { + $should_be = $j; + } + elsif ($map_ref->[$index] != $missing) { + $should_be = $map_ref->[$index] + $j - $list_ref->[$index] } else { - my $should_be = ($i < 256 - && ! ref $map_ref->[$index] - && $map_ref->[$index] != $missing - ) - ? $map_ref->[$index] + $j - $list_ref->[$index] - : $j; - is ($ret, $should_be, sprintf("to${function}_LC( $display_name ) == 0x%02X ($utf8_locale)", $should_be)); + $should_be = $j; } + + is ($ret, $should_be, + sprintf("${tab}And correctly returned 0x%02X", + $should_be)); } } } - # The _uni and _utf8 functions return both the ordinal of the first - # code point of the result, and the result in utf8. The .xs tests - # return these in an array, in [0] and [1] respectively, with [2] the - # length of the utf8 in bytes. + # The _uni, uvchr, and _utf8 functions return both the ordinal of the + # first code point of the result, and the result in utf8. The .xs + # tests return these in an array, in [0] and [1] respectively, with + # [2] the length of the utf8 in bytes. my $utf8_should_be = ""; my $first_ord_should_be; if (ref $map_ref->[$index]) { # A multi-char result @@ -414,47 +422,52 @@ foreach my $name (sort keys %to_properties) { } else { # A single-char result $first_ord_should_be = ($map_ref->[$index] != $missing) - ? $map_ref->[$index] + $j - $list_ref->[$index] + ? $map_ref->[$index] + $j + - $list_ref->[$index] : $j; $utf8_should_be = chr $first_ord_should_be; } utf8::upgrade($utf8_should_be); - # Test _uni - my $s; - my $len; - $ret = eval "test_to${function}_uni($j)"; - if ($@) { - fail($@); - } - else { - is ($ret->[0], $first_ord_should_be, sprintf("to${function}_uni( $display_name ) == 0x%02X", $first_ord_should_be)); - is ($ret->[1], $utf8_should_be, sprintf("utf8 of to${function}_uni( $display_name )")); - use bytes; - is ($ret->[2], length $utf8_should_be, sprintf("number of bytes in utf8 of to${function}_uni( $display_name )")); + # Test _uni, uvchr + foreach my $suffix ('_uni', '_uvchr') { + my $s; + my $len; + my $display_call = "to${function}$suffix( $display_name )"; + $ret = eval "test_to${function}$suffix($j)"; + if (is ($@, "", "$display_call didn't give error")) { + is ($ret->[0], $first_ord_should_be, + sprintf("${tab}And correctly returned 0x%02X", + $first_ord_should_be)); + is ($ret->[1], $utf8_should_be, "${tab}Got correct utf8"); + use bytes; + is ($ret->[2], length $utf8_should_be, + "${tab}Got correct number of bytes for utf8 length"); + } } # Test _utf8 my $char = chr($j); utf8::upgrade($char); $char = quotemeta $char if $char eq '\\' || $char eq "'"; - $ret = eval "test_to${function}_utf8('$char')"; - if ($@) { - fail($@); - } - else { - is ($ret->[0], $first_ord_should_be, sprintf("to${function}_utf8( $display_name ) == 0x%02X", $first_ord_should_be)); - is ($ret->[1], $utf8_should_be, sprintf("utf8 of to${function}_utf8( $display_name )")); - use bytes; - is ($ret->[2], length $utf8_should_be, sprintf("number of bytes in utf8 of to${function}_uni( $display_name )")); + { + my $display_call = "to${function}_utf8($display_name )"; + $ret = eval "test_to${function}_utf8('$char')"; + if (is ($@, "", "$display_call didn't give error")) { + is ($ret->[0], $first_ord_should_be, + sprintf("${tab}And correctly returned 0x%02X", + $first_ord_should_be)); + is ($ret->[1], $utf8_should_be, "${tab}Got correct utf8"); + use bytes; + is ($ret->[2], length $utf8_should_be, + "${tab}Got correct number of bytes for utf8 length"); + } } - } } # This is primarily to make sure that no non-Unicode warnings get generated -unless (is(scalar @warnings, 0, "No warnings were generated")) { - diag @warnings; -} +is(scalar @warnings, 0, "No unexpected warnings were generated in the tests") + or diag @warnings; done_testing; diff --git a/proto.h b/proto.h index 7fcb579ee1..ecf6f71c11 100644 --- a/proto.h +++ b/proto.h @@ -4541,7 +4541,7 @@ STATIC void S_finalize_op(pTHX_ OP* o); STATIC void S_find_and_forget_pmops(pTHX_ OP *o); #define PERL_ARGS_ASSERT_FIND_AND_FORGET_PMOPS \ assert(o) -STATIC OP* S_fold_constants(pTHX_ OP *o); +STATIC OP* S_fold_constants(pTHX_ OP * const o); #define PERL_ARGS_ASSERT_FOLD_CONSTANTS \ assert(o) STATIC OP* S_force_list(pTHX_ OP* arg, bool nullit); @@ -5641,7 +5641,7 @@ STATIC SV* S_swatch_get(pTHX_ SV* swash, UV start, UV span) #define PERL_ARGS_ASSERT_SWATCH_GET \ assert(swash) -STATIC U8 S_to_lower_latin1(const U8 c, U8 *p, STRLEN *lenp) +STATIC U8 S_to_lower_latin1(const U8 c, U8 *p, STRLEN *lenp, const char dummy) __attribute__warn_unused_result__; STATIC char * S_unexpected_non_continuation_text(pTHX_ const U8 * const s, STRLEN print_len, const STRLEN non_cont_byte_pos, const STRLEN expect_len) diff --git a/toke.c b/toke.c index f695265665..f0a7dbc3a0 100644 --- a/toke.c +++ b/toke.c @@ -9467,7 +9467,8 @@ S_scan_subst(pTHX_ char *start) * spreads over */ sv_upgrade(PL_parser->lex_sub_repl, SVt_PVNV); ((XPVNV*)SvANY(PL_parser->lex_sub_repl))->xnv_u.xnv_lines = 0; - ((XPVIV*)SvANY(PL_parser->lex_sub_repl))->xiv_u.xivu_eval_seen = es; + ((XPVIV*)SvANY(PL_parser->lex_sub_repl))->xiv_u.xivu_eval_seen = + cBOOL(es); } PL_lex_op = (OP*)pm; diff --git a/utf8.c b/utf8.c index bc7cc0638c..9fe9b03ed8 100644 --- a/utf8.c +++ b/utf8.c @@ -2207,7 +2207,7 @@ Perl_to_uni_title(pTHX_ UV c, U8* p, STRLEN *lenp) } STATIC U8 -S_to_lower_latin1(const U8 c, U8* p, STRLEN *lenp) +S_to_lower_latin1(const U8 c, U8* p, STRLEN *lenp, const char dummy) { /* We have the latin1-range values compiled into the core, so just use * those, converting the result to UTF-8. Since the result is always just @@ -2215,6 +2215,8 @@ S_to_lower_latin1(const U8 c, U8* p, STRLEN *lenp) U8 converted = toLOWER_LATIN1(c); + PERL_UNUSED_ARG(dummy); + if (p != NULL) { if (NATIVE_BYTE_IS_INVARIANT(converted)) { *p = converted; @@ -2237,7 +2239,7 @@ Perl_to_uni_lower(pTHX_ UV c, U8* p, STRLEN *lenp) PERL_ARGS_ASSERT_TO_UNI_LOWER; if (c < 256) { - return to_lower_latin1((U8) c, p, lenp); + return to_lower_latin1((U8) c, p, lenp, 0 /* 0 is a dummy arg */ ); } uvchr_to_utf8(p, c); @@ -2746,6 +2748,89 @@ S_check_locale_boundary_crossing(pTHX_ const U8* const p, const UV result, U8* c return original; } +/* The process for changing the case is essentially the same for the four case + * change types, except there are complications for folding. Otherwise the + * difference is only which case to change to. To make sure that they all do + * the same thing, the bodies of the functions are extracted out into the + * following two macros. The functions are written with the same variable + * names, and these are known and used inside these macros. It would be + * better, of course, to have inline functions to do it, but since different + * macros are called, depending on which case is being changed to, this is not + * feasible in C (to khw's knowledge). Two macros are created so that the fold + * function can start with the common start macro, then finish with its special + * handling; while the other three cases can just use the common end macro. + * + * The algorithm is to use the proper (passed in) macro or function to change + * the case for code points that are below 256. The macro is used if using + * locale rules for the case change; the function if not. If the code point is + * above 255, it is computed from the input UTF-8, and another macro is called + * to do the conversion. If necessary, the output is converted to UTF-8. If + * using a locale, we have to check that the change did not cross the 255/256 + * boundary, see check_locale_boundary_crossing() for further details. + * + * The macros are split with the correct case change for the below-256 case + * stored into 'result', and in the middle of an else clause for the above-255 + * case. At that point in the 'else', 'result' is not the final result, but is + * the input code point calculated from the UTF-8. The fold code needs to + * realize all this and take it from there. + * + * If you read the two macros as sequential, it's easier to understand what's + * going on. */ +#define CASE_CHANGE_BODY_START(locale_flags, LC_L1_change_macro, L1_func, \ + L1_func_extra_param) \ + if (flags & (locale_flags)) { \ + /* Treat a UTF-8 locale as not being in locale at all */ \ + if (IN_UTF8_CTYPE_LOCALE) { \ + flags &= ~(locale_flags); \ + } \ + else { \ + _CHECK_AND_WARN_PROBLEMATIC_LOCALE; \ + } \ + } \ + \ + if (UTF8_IS_INVARIANT(*p)) { \ + if (flags & (locale_flags)) { \ + result = LC_L1_change_macro(*p); \ + } \ + else { \ + return L1_func(*p, ustrp, lenp, L1_func_extra_param); \ + } \ + } \ + else if UTF8_IS_DOWNGRADEABLE_START(*p) { \ + if (flags & (locale_flags)) { \ + result = LC_L1_change_macro(EIGHT_BIT_UTF8_TO_NATIVE(*p, \ + *(p+1))); \ + } \ + else { \ + return L1_func(EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)), \ + ustrp, lenp, L1_func_extra_param); \ + } \ + } \ + else { /* malformed UTF-8 */ \ + result = valid_utf8_to_uvchr(p, NULL); \ + +#define CASE_CHANGE_BODY_END(locale_flags, change_macro) \ + result = change_macro(result, p, ustrp, lenp); \ + \ + if (flags & (locale_flags)) { \ + result = check_locale_boundary_crossing(p, result, ustrp, lenp); \ + } \ + return result; \ + } \ + \ + /* Here, used locale rules. Convert back to UTF-8 */ \ + if (UTF8_IS_INVARIANT(result)) { \ + *ustrp = (U8) result; \ + *lenp = 1; \ + } \ + else { \ + *ustrp = UTF8_EIGHT_BIT_HI((U8) result); \ + *(ustrp + 1) = UTF8_EIGHT_BIT_LO((U8) result); \ + *lenp = 2; \ + } \ + \ + return result; + /* =for apidoc to_utf8_upper @@ -2764,55 +2849,10 @@ Perl__to_utf8_upper_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_UPPER_FLAGS; - if (flags) { - /* Treat a UTF-8 locale as not being in locale at all */ - if (IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; - } - else { - _CHECK_AND_WARN_PROBLEMATIC_LOCALE; - } - } - - if (UTF8_IS_INVARIANT(*p)) { - if (flags) { - result = toUPPER_LC(*p); - } - else { - return _to_upper_title_latin1(*p, ustrp, lenp, 'S'); - } - } - else if UTF8_IS_DOWNGRADEABLE_START(*p) { - if (flags) { - U8 c = EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)); - result = toUPPER_LC(c); - } - else { - return _to_upper_title_latin1(EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)), - ustrp, lenp, 'S'); - } - } - else { /* UTF-8, ord above 255 */ - result = CALL_UPPER_CASE(valid_utf8_to_uvchr(p, NULL), p, ustrp, lenp); - - if (flags) { - result = check_locale_boundary_crossing(p, result, ustrp, lenp); - } - return result; - } - - /* Here, used locale rules. Convert back to UTF-8 */ - if (UTF8_IS_INVARIANT(result)) { - *ustrp = (U8) result; - *lenp = 1; - } - else { - *ustrp = UTF8_EIGHT_BIT_HI((U8) result); - *(ustrp + 1) = UTF8_EIGHT_BIT_LO((U8) result); - *lenp = 2; - } - - return result; + /* ~0 makes anything non-zero in 'flags' mean we are using locale rules */ + /* 2nd char of uc(U+DF) is 'S' */ + CASE_CHANGE_BODY_START(~0, toUPPER_LC, _to_upper_title_latin1, 'S'); + CASE_CHANGE_BODY_END (~0, CALL_UPPER_CASE); } /* @@ -2835,55 +2875,9 @@ Perl__to_utf8_title_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_TITLE_FLAGS; - if (flags) { - /* Treat a UTF-8 locale as not being in locale at all */ - if (IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; - } - else { - _CHECK_AND_WARN_PROBLEMATIC_LOCALE; - } - } - - if (UTF8_IS_INVARIANT(*p)) { - if (flags) { - result = toUPPER_LC(*p); - } - else { - return _to_upper_title_latin1(*p, ustrp, lenp, 's'); - } - } - else if UTF8_IS_DOWNGRADEABLE_START(*p) { - if (flags) { - U8 c = EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)); - result = toUPPER_LC(c); - } - else { - return _to_upper_title_latin1(EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)), - ustrp, lenp, 's'); - } - } - else { /* UTF-8, ord above 255 */ - result = CALL_TITLE_CASE(valid_utf8_to_uvchr(p, NULL), p, ustrp, lenp); - - if (flags) { - result = check_locale_boundary_crossing(p, result, ustrp, lenp); - } - return result; - } - - /* Here, used locale rules. Convert back to UTF-8 */ - if (UTF8_IS_INVARIANT(result)) { - *ustrp = (U8) result; - *lenp = 1; - } - else { - *ustrp = UTF8_EIGHT_BIT_HI((U8) result); - *(ustrp + 1) = UTF8_EIGHT_BIT_LO((U8) result); - *lenp = 2; - } - - return result; + /* 2nd char of ucfirst(U+DF) is 's' */ + CASE_CHANGE_BODY_START(~0, toUPPER_LC, _to_upper_title_latin1, 's'); + CASE_CHANGE_BODY_END (~0, CALL_TITLE_CASE); } /* @@ -2905,56 +2899,8 @@ Perl__to_utf8_lower_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, bool flags PERL_ARGS_ASSERT__TO_UTF8_LOWER_FLAGS; - if (flags) { - /* Treat a UTF-8 locale as not being in locale at all */ - if (IN_UTF8_CTYPE_LOCALE) { - flags = FALSE; - } - else { - _CHECK_AND_WARN_PROBLEMATIC_LOCALE; - } - } - - if (UTF8_IS_INVARIANT(*p)) { - if (flags) { - result = toLOWER_LC(*p); - } - else { - return to_lower_latin1(*p, ustrp, lenp); - } - } - else if UTF8_IS_DOWNGRADEABLE_START(*p) { - if (flags) { - U8 c = EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)); - result = toLOWER_LC(c); - } - else { - return to_lower_latin1(EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)), - ustrp, lenp); - } - } - else { /* UTF-8, ord above 255 */ - result = CALL_LOWER_CASE(valid_utf8_to_uvchr(p, NULL), p, ustrp, lenp); - - if (flags) { - result = check_locale_boundary_crossing(p, result, ustrp, lenp); - } - - return result; - } - - /* Here, used locale rules. Convert back to UTF-8 */ - if (UTF8_IS_INVARIANT(result)) { - *ustrp = (U8) result; - *lenp = 1; - } - else { - *ustrp = UTF8_EIGHT_BIT_HI((U8) result); - *(ustrp + 1) = UTF8_EIGHT_BIT_LO((U8) result); - *lenp = 2; - } - - return result; + CASE_CHANGE_BODY_START(~0, toLOWER_LC, to_lower_latin1, 0 /* 0 is dummy */) + CASE_CHANGE_BODY_END (~0, CALL_LOWER_CASE) } /* @@ -2986,38 +2932,10 @@ Perl__to_utf8_fold_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, U8 flags) assert(p != ustrp); /* Otherwise overwrites */ - if (flags & FOLD_FLAGS_LOCALE) { - /* Treat a UTF-8 locale as not being in locale at all */ - if (IN_UTF8_CTYPE_LOCALE) { - flags &= ~FOLD_FLAGS_LOCALE; - } - else { - _CHECK_AND_WARN_PROBLEMATIC_LOCALE; - } - } + CASE_CHANGE_BODY_START(FOLD_FLAGS_LOCALE, toFOLD_LC, _to_fold_latin1, + ((flags) & (FOLD_FLAGS_FULL | FOLD_FLAGS_NOMIX_ASCII))); - if (UTF8_IS_INVARIANT(*p)) { - if (flags & FOLD_FLAGS_LOCALE) { - result = toFOLD_LC(*p); - } - else { - return _to_fold_latin1(*p, ustrp, lenp, - flags & (FOLD_FLAGS_FULL | FOLD_FLAGS_NOMIX_ASCII)); - } - } - else if UTF8_IS_DOWNGRADEABLE_START(*p) { - if (flags & FOLD_FLAGS_LOCALE) { - U8 c = EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)); - result = toFOLD_LC(c); - } - else { - return _to_fold_latin1(EIGHT_BIT_UTF8_TO_NATIVE(*p, *(p+1)), - ustrp, lenp, - flags & (FOLD_FLAGS_FULL | FOLD_FLAGS_NOMIX_ASCII)); - } - } - else { /* UTF-8, ord above 255 */ - result = CALL_FOLD_CASE(valid_utf8_to_uvchr(p, NULL), p, ustrp, lenp, flags & FOLD_FLAGS_FULL); + result = CALL_FOLD_CASE(result, p, ustrp, lenp, flags & FOLD_FLAGS_FULL); if (flags & FOLD_FLAGS_LOCALE) { -- Perl5 Master Repository
