In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/42327f062c1a45d590c71a5bceb2682cc200be1e?hp=636fe681770e29874acb69ab162bc53731425867>
- Log ----------------------------------------------------------------- commit 42327f062c1a45d590c71a5bceb2682cc200be1e Author: Karl Williamson <[email protected]> Date: Sat Oct 18 18:56:13 2014 -0600 Document length-1 variable name rules This cleans up the existing documentation with added details, and documents the effect of recent changes. M pod/perldata.pod M pod/perldelta.pod commit e92292576e58ce767c60c4cd8ebc1989792659ec Author: Karl Williamson <[email protected]> Date: Sun Oct 19 23:03:44 2014 -0600 Don't allow literal control chars in var names in EBCDIC Currently, a variable name of length-1 may have as its name some of the possible control characters, though this usage is deprecated. It is a pain to fix this to work properly on EBCDIC, and since the use of these is deprecated, the pumpking agreed with me to not to bother with doing so. M t/lib/warnings/toke M t/uni/variables.t M toke.c commit 4475d0d23c30e1ffbe123b0f5e3b800c0be35f4c Author: Karl Williamson <[email protected]> Date: Sun Oct 19 10:35:04 2014 -0600 Deprecate all length-1 non-graphic variable names v5.20 deprecated all otherwise-legal control characters as length-1 variable names. This extends this to include all non-graphic characters. The practical effect of this occurs only when not under "use utf8", and affects just the C1 controls (code points 0x80 through 0xFF), NO-BREAK SPACE, and SOFT HYPHEN. M pod/perldelta.pod M pod/perldiag.pod M t/uni/variables.t M toke.c commit de1631081e2407e5021257de0eb1de20ad1534ca Author: Karl Williamson <[email protected]> Date: Sun Oct 19 09:42:25 2014 -0600 toke.c: Simplify macro for length-1 variable names It turns out that any non-NUL, non-space ASCII character that gets this far in the program is a valid length-1 variable. So, can simplify the expression in the macro that tests for legal ones. M t/uni/variables.t M toke.c commit 3027a07d5f585cc9ab5e9c0c08df9f73a758e86b Author: Karl Williamson <[email protected]> Date: Sun Oct 19 09:20:44 2014 -0600 t/uni/variables.t: Add tests for the generated warnings M t/uni/variables.t commit 40c554ff2a344883f8b00708a6a832d79136bf1e Author: Karl Williamson <[email protected]> Date: Sat Oct 18 19:00:01 2014 -0600 t/uni/variables.t: White-space only Indent to newly-formed (by the previous commit) blocks M t/uni/variables.t commit fef6cdc52740801feb44b31268d429391c8e6103 Author: Karl Williamson <[email protected]> Date: Sat Oct 18 08:56:00 2014 -0600 t/uni/variables.t: Add tests The length-1 variables in the ASCII range had not been systematically tested until this commit M t/uni/variables.t commit 2fb9f1434f6c6f7916e429be68140fc5d6b9eb31 Author: Karl Williamson <[email protected]> Date: Mon Oct 20 09:30:01 2014 -0600 t/uni/variables.t: Add test skips A future commit will change the behavior of this test depending on whether it is run on an EBCDIC vs ASCII platform. In the loop, add skips so the total number run per iteration is the same no matter what branches (and tests) get run in the iteration. This will allow the ASCII and EBCDIC test plan to have the same number. This in turn will prevent platform-specific failures when adding/subtracting tests in this file. M t/uni/variables.t commit 502bdc0fe4cdbb3e2e0b49893a622431a77ce409 Author: Karl Williamson <[email protected]> Date: Fri Oct 17 23:27:24 2014 -0600 t/uni/variables.t: Rename variable A future commit will be using the current variable name for a different purpose. M t/uni/variables.t commit 048c5953b72547aea9b36024cb01c8605a016cca Author: Karl Williamson <[email protected]> Date: Fri Oct 17 11:43:06 2014 -0600 t/uni/variables.t: Clarify some test names I found it hard to read use utf8 instead of 'use utf8', etc, but more importantly, this changes a bunch of tests so the character in related tests is only named in the first such, and more description is used. This is in preparation for a future commit. M t/uni/variables.t commit d84210af634f1e1759ff6baaaf8c1c2fe656d6fe Author: Karl Williamson <[email protected]> Date: Mon Apr 1 22:28:08 2013 -0600 t/uni/variables.t: Fix to run under EBCDIC M t/uni/variables.t commit f7d29c389f10807951c8c1a6b07cd11099e5c733 Author: Karl Williamson <[email protected]> Date: Fri Oct 17 22:54:21 2014 -0600 t/uni/variables.t: Remove duplicate test The same exact test was repeated. M t/uni/variables.t commit 77462865e9cbba7a19b6599367e59da6dc03a7a6 Author: Karl Williamson <[email protected]> Date: Fri Oct 17 11:50:13 2014 -0600 t/uni/variables.t: Remove unused param to sprintf M t/uni/variables.t commit 5333b64c49d92edaebcba2a634f2cb1d4edc5920 Author: Karl Williamson <[email protected]> Date: Thu Oct 16 22:24:13 2014 -0600 toke.c: Fix potential buffer overflow Malformed UTF-8 input could have potentially caused a read off the end of the buffer here until this commit. M toke.c commit aae773bbd74e39596c3ed0ed403f36b245b6d687 Author: Karl Williamson <[email protected]> Date: Thu Oct 16 22:03:30 2014 -0600 perlebcdic: Add clarification M pod/perlebcdic.pod commit 9415f65901e09c7e3366bca00059a439e82fa804 Author: Karl Williamson <[email protected]> Date: Wed Jun 26 15:40:53 2013 -0600 uvoffuni_to_utf8_flags() die if platform can't handle On non EBCDIC platforms currently any UV is encodable as UTF-8. (This would change if there were 128-bit words). Thus, much code assumes that nothing can go wrong when converting to UTF-8, and hence does no error checking. However, UTF-EBCDIC is only capable of representing code points below 2**32, so if there are 64-bit words, this function can fail. Prior to this patch, there was no real overflow check, and garbage was returned by this function if called with too large a number. While not ideal, the easiest thing to do is to just die for such a number, like we do for division by 0. This involves changing only code within this function, and not its many callers. M pod/perldiag.pod M t/lib/warnings/utf8 M utf8.c commit a27ed9805e6767d92a9ba275f4379cd1f342cafe Author: Karl Williamson <[email protected]> Date: Thu Oct 16 19:24:53 2014 -0600 dquote_static.c: Forbid \c{ on EBCDIC Unlike ASCII platforms where \c{ evaluates to a semi-colon, and raises a fatal error because that result is not a control character, on EBCDIC platforms it does evaluate to a control, but we have to forbid it anyway because unescaped "{" now is special as a part of larger sequences, and we don't want to get the parser confused. M dquote_static.c M pod/perldiag.pod M t/lib/warnings/toke commit d9759938eb6b6cd4b44f6be9200b1fdcce67382c Author: Karl Williamson <[email protected]> Date: Thu Oct 16 19:44:07 2014 -0600 utf8.h: EBCDIC fix These macros are supposed to accommodate larger than a byte inputs. Therefore, under EBCDIC, we have to use a different macro which handles the larger values. On ASCII platforms, these called macros are no-ops so it doesn't matter there. M utf8.h commit 88794300ed3b4c4d9b3ec9303c1b0c910d503082 Author: Karl Williamson <[email protected]> Date: Thu Oct 16 19:55:42 2014 -0600 handy.h: Two EBCDIC fixes In EBCDIC only macros, an argument previously was failed to be dereferenced, and there was an extra ==. A few comment changes as well M handy.h commit 4d2de1058c946439368821b6622c0290c8e9f855 Author: Karl Williamson <[email protected]> Date: Thu Oct 16 20:45:17 2014 -0600 lib/locale.t: Add some \Q More places in this file need to have \Q to quote things. This was noticed in EBCDIC testing. M lib/locale.t commit c46c4601920d32884d90fbc4c84b4947cb60f436 Author: Karl Williamson <[email protected]> Date: Wed Apr 17 21:47:41 2013 -0600 t/re/regexp.t: Add ability to skip depending on platform This adds the capability to specify that a test is to be done only on an ASCII platform, or only on an EBCDIC. M t/re/regexp.t commit d6395ff9ae082229dda4b7ca25537f92c446bbab Author: Karl Williamson <[email protected]> Date: Tue Apr 16 15:49:06 2013 -0600 t/re/regexp.t: Generalize for non-ASCII platforms This adds code to the processing of the tests in t/re/re_tests to automatically convert most character constants from unicode to native character sets. This allows most tests in t/re/re_tests to be run on both platforms without change. A later commit will add the capability to skip individual tests if on the wrong platform, so those few tests that this commit doesn't work for can be accommodated M t/re/regexp.t commit 98c62be8f6a8b6ad8f0782b33467fabcfe191465 Author: Karl Williamson <[email protected]> Date: Mon Jul 7 16:59:19 2014 -0600 charset_tools.pl: Fix changed function name The function has changed names, but this file did not get updated to reflect that. M t/charset_tools.pl commit 92c0a900a3381af22f6d14091dbaffaa47ad592c Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:43:26 2014 -0600 EBCDIC doesn't have real UTF-8 locales. At least on the system that we have tested on. There are locales that say they are UTF-8, but they're not; they're EBCDIC 1047. M locale.c M t/loc_tools.pl commit 1ff128e487762041398e9cf94a21edf1833aa2ed Author: Yaroslav Kuzmin <[email protected]> Date: Wed Jul 2 22:22:23 2014 -0600 Fix ebcdic error of nostdio.h M nostdio.h commit 3d915d2236450147aaf3ae3b16195dd74072c071 Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:41:22 2014 -0600 regcomp.c: isASCII should be isASCII_uni It doesn't effectively matter, but isACII_uni is the one that is supposed to be used for UVs. M regcomp.c commit 0852beacbdeb373adc1e26bf890db2d8698ec0fb Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:24:50 2014 -0600 Fix isASCII for EBCDIC Prior to this commit isASCII on EBCDIC platforms was defined as the isascii() libc function. It turns out that this doesn't work properly. It needed to be this way back when EBCDIC was bootstrapped onto the target machine, but now, various header files are furnished with the requisite definitions, so this is no longer necessary. The problem with isascii() is that it is locale-dependent, unlike on ASCII platforms. This means that instead of getting a standard ASCII definition, it returns whatever the underlying locale says, even if there is no 'use locale' anywhere in the program. Starting with this commit, the isASCII definition now comes from the l1_char_class_tab.h file which we know is accurate and not locale-dependent. This header can be used in compilations of utility programs where perl.h is not available. For these, there are alternate, more complicated definitions, should they be needed in those utility programs. Several of those definitions prior to this commit also used locale-sensitive isfoo() functions. The bulk of this commit refactors those definitions to not use these functions as much as possible. As noted in the added comments in the code, the one remaining use of such a function is only for the lesser-used control characters. Likely these aren't used in the utility programs. M handy.h commit caa94d3551e3f67e2b7157876e52748d055d30a4 Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:21:16 2014 -0600 handy.h: Add missing macro This section of code is normally not compiled, but when circumstances call for it to be compiled, it may be missing the macro defined in this commit, which is trivial on ASCII platforms, so just define it if missing M handy.h commit 6838b41e3cd58738e5e73705c9cef5bef6e28885 Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:16:59 2014 -0600 handy.h: Need macro definition for normally non-compiled code This section of code is compiled only when perl.h is not available, i.e. for utility programs. I periodically test that it still works, and this time a macro was added to the other branch of the #if, but not this one. This commit adds a trivial one to the missing area. M handy.h commit 687c8d01ec65dc1466743a4045ad091451aedc41 Author: Karl Williamson <[email protected]> Date: Wed Oct 8 11:12:12 2014 -0600 handy.h: Comments only Removes obsolete comment, and adds text to make it easier to find matching #else and #endif of a #if M handy.h commit ea5bc90f605c42b5ba11bf0aa5b9df6e7d9ec3e7 Author: Yaroslav Kuzmin <[email protected]> Date: Sun Sep 28 10:02:17 2014 -0600 perlio: Fix to work with MVS Dataset M AUTHORS M perlio.c commit 8a65147e419322d7c5ae5680e93886a7deb8b7c3 Author: Karl Williamson <[email protected]> Date: Sun Aug 24 10:12:49 2014 -0600 mktables: Use chr() instead of pack() The input is the native code point, not the Unicode one, and so it is a bug that pack(U) works. The W format is what should be used, but I think chr() is easier to read. M lib/unicore/mktables commit 60071a22d6a6c85bd3b817ebc160338978eee9bd Author: Karl Williamson <[email protected]> Date: Sun Jun 22 21:29:28 2014 -0600 utf8.c: Improve debug message This function was called with an empty string "" because that string was not actually needed in the function, except to better identify the source when there is an error. So change to specify the actual source. M utf8.c commit 5e2adf8597b4131da54f048b3d903feb3849164a Author: Karl Williamson <[email protected]> Date: Mon Jun 9 12:28:10 2014 -0600 regcomp.c: EBCDIC fix M regcomp.c commit d11389730686d2620d5490fa96eb210ea5837773 Author: Karl Williamson <[email protected]> Date: Wed Feb 27 21:59:11 2013 -0700 makedepend.SH: Split too long lines; properly join I had thought that a continuation introduced a space. But no, a continuation can happen in the middle of a token. And this splits lines that are getting very long to avoid preprocessor limitations. M makedepend.SH commit 0e0580046e789498690a4323d54c9c0ba13a05c5 Author: Karl Williamson <[email protected]> Date: Wed Feb 27 15:51:28 2013 -0700 makedepend.SH: White-space only Align continuation backslashes M makedepend.SH commit 558bc5733bc9dde03f4161b1179a8a49b7dbc432 Author: Karl Williamson <[email protected]> Date: Wed Feb 27 14:39:28 2013 -0700 makedepend.SH: Remove some unnecessary white space Multi-line preprocessor directives are now joined into single lines. This can create lines too long for the preprocessor to handle. This commit removes blanks adjoining comments that get deleted. This makes things somewhat less likely to exceed the limit. This commit also fixes several [] which were meant to each match a tab or a blank, but editors converted the tabs to blanks M makedepend.SH commit 86c3d61a7d00d9c0436b87cee87cb18a0055b6f8 Author: Karl Williamson <[email protected]> Date: Wed Feb 27 14:30:51 2013 -0700 makedepend.SH: Retain '/**/' comments These comments may actually be necessary. M makedepend.SH commit 6edd329f2057bbd709a169f99a8ea89d83e6b1ad Author: Karl Williamson <[email protected]> Date: Tue Feb 26 09:00:18 2013 -0700 makedepend.SH: Comment out unnecessary code This causes problems currently for z/OS. But, since we don't know why it was there, I'm leaving it in as a placeholder. M makedepend.SH commit 636a024b8c315521b8ddde9c3a7d31cea4e8d2f2 Author: Andy Dougherty <[email protected]> Date: Wed Feb 27 13:06:07 2013 -0500 Disable gcc-style function attributes on z/OS. John Goodyear <[email protected]> reports that the z/OS C compiler supports the attribute keyword, but not exactly the same as gcc. Instead of a "warning", the compiler emits an "INFORMATIONAL" message that Configure fails to detect. Until Configure is fixed, just disable the attributes altogether. John Goodyear M hints/os390.sh commit a4c2c75af551764534f9f8cb6afd4ef6ba514f45 Author: Andy Dougherty <[email protected]> Date: Wed Feb 27 09:12:13 2013 -0500 Change os390 custom cppstdin script to use fgrep. Grep appears to be limited to 2048 characters, and truncates the output for cppstin. Fgrep apparently doesn't have that limit. Thanks to John Goodyear <[email protected]> for reporting this. M hints/os390.sh ----------------------------------------------------------------------- Summary of changes: AUTHORS | 1 + dquote_static.c | 12 ++- handy.h | 151 ++++++++++++++++++++---------- hints/os390.sh | 23 ++++- lib/locale.t | 24 ++--- lib/unicore/mktables | 9 +- locale.c | 3 + makedepend.SH | 64 +++++++------ nostdio.h | 4 + perlio.c | 33 +++++++ pod/perldata.pod | 60 +++++++----- pod/perldelta.pod | 18 +++- pod/perldiag.pod | 32 ++++++- pod/perlebcdic.pod | 4 +- regcomp.c | 4 +- t/charset_tools.pl | 2 +- t/lib/warnings/toke | 33 ++++++- t/lib/warnings/utf8 | 18 +++- t/loc_tools.pl | 4 + t/re/regexp.t | 39 ++++++++ t/uni/variables.t | 257 +++++++++++++++++++++++++++++++++++++++------------ toke.c | 61 ++++++++---- utf8.c | 13 ++- utf8.h | 4 +- 24 files changed, 664 insertions(+), 209 deletions(-) diff --git a/AUTHORS b/AUTHORS index 4b059a6..482b2de 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1229,6 +1229,7 @@ Wolfgang Laun <[email protected]> Wolfram Humann <[email protected]> Xavier Noria <[email protected]> YAMASHINA Hio <[email protected]> +Yaroslav Kuzmin <[email protected]> Yary Hluchan Yasushi Nakajima <[email protected]> Yitzchak Scott-Thoennes <[email protected]> diff --git a/dquote_static.c b/dquote_static.c index 802d83b..5fe7f0b 100644 --- a/dquote_static.c +++ b/dquote_static.c @@ -50,10 +50,14 @@ S_grok_bslash_c(pTHX_ const char source, const bool output_warning) "Character following \"\\c\" must be printable ASCII"); } else if (source == '{') { - assert(isPRINT_A(toCTRL('{'))); - - /* diag_listed_as: Use "%s" instead of "%s" */ - Perl_croak(aTHX_ "Use \"%c\" instead of \"\\c{\"", toCTRL('{')); + const char control = toCTRL('{'); + if (isPRINT_A(control)) { + /* diag_listed_as: Use "%s" instead of "%s" */ + Perl_croak(aTHX_ "Use \"%c\" instead of \"\\c{\"", control); + } + else { + Perl_croak(aTHX_ "Sequence \"\\c{\" invalid"); + } } result = toCTRL(source); diff --git a/handy.h b/handy.h index 5e0c86e..445d495 100644 --- a/handy.h +++ b/handy.h @@ -275,6 +275,7 @@ typedef U64TYPE U64; #define TYPE_DIGITS(T) BIT_DIGITS(sizeof(T) * 8) #define TYPE_CHARS(T) (TYPE_DIGITS(T) + 2) /* sign, NUL */ +/* Unused by core; should be deprecated */ #define Ctl(ch) ((ch) & 037) /* This is a helper macro to avoid preprocessor issues, replaced by nothing @@ -893,27 +894,22 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc #ifdef EBCDIC # ifndef _ALL_SOURCE - /* This returns the wrong results on at least z/OS unless this is - * defined. */ + /* The native libc isascii() et.al. functions return the wrong results + * on at least z/OS unless this is defined. */ # error _ALL_SOURCE should probably be defined # endif - - /* We could be called without perl.h, in which case NATIVE_TO_ASCII() is - * likely not defined, and so we use the native function */ -# define isASCII(c) cBOOL(isascii(c)) #else + /* There is a simple definition of ASCII for ASCII platforms. But the + * EBCDIC one isn't so simple, so is defined using table look-up like the + * other macros below */ # define isASCII(c) ((WIDEST_UTYPE)(c) < 128) #endif -#define isASCII_A(c) isASCII(c) -#define isASCII_L1(c) isASCII(c) - /* The lower 3 bits in both the ASCII and EBCDIC representations of '0' are 0, * and the 8 possible permutations of those bits exactly comprise the 8 octal * digits */ #define isOCTAL_A(c) cBOOL(FITS_IN_8_BITS(c) && (0xF8 & (c)) == '0') -/* ASCII range only */ #ifdef H_PERL /* If have access to perl.h, lookup in its table */ /* Character class numbers. For internal core Perl use only. The ones less @@ -1089,6 +1085,10 @@ EXTCONST U32 PL_charclass[]; # define isWORDCHAR_L1(c) _generic_isCC(c, _CC_WORDCHAR) # define isIDFIRST_L1(c) _generic_isCC(c, _CC_IDFIRST) +# ifdef EBCDIC +# define isASCII(c) _generic_isCC(c, _CC_ASCII) +# endif + /* Participates in a single-character fold with a character above 255 */ # define _HAS_NONLATIN1_SIMPLE_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(c) ((! cBOOL(FITS_IN_8_BITS(c))) || (PL_charclass[(U8) (c)] & _CC_mask(_CC_NONLATIN1_SIMPLE_FOLD))) @@ -1102,43 +1102,95 @@ EXTCONST U32 PL_charclass[]; _generic_isCC(c, _CC_IS_IN_SOME_FOLD) # define _IS_MNEMONIC_CNTRL_ONLY_FOR_USE_BY_REGCOMP_DOT_C(c) \ _generic_isCC(c, _CC_MNEMONIC_CNTRL) -#else /* else we don't have perl.h */ +#else /* else we don't have perl.h H_PERL */ /* If we don't have perl.h, we are compiling a utility program. Below we * hard-code various macro definitions that wouldn't otherwise be available - * to it. */ -# ifdef EBCDIC - /* Use the native functions. They likely will return false for all - * non-ASCII values, but this makes sure */ -# define isLOWER_A(c) (isASCII(c) && islower(c)) -# define isPRINT_A(c) (isASCII(c) && isprint(c)) -# define isUPPER_A(c) (isASCII(c) && isupper(c)) -# else /* ASCII platform. These are coded based on first principals */ + * to it. Most are coded based on first principals. First some ones common + * to both ASCII and EBCDIC */ +# define isDIGIT_A(c) ((c) <= '9' && (c) >= '0') +# define isBLANK_A(c) ((c) == ' ' || (c) == '\t') +# define isSPACE_A(c) (isBLANK_A(c) \ + || (c) == '\n' \ + || (c) == '\r' \ + || (c) == '\v' \ + || (c) == '\f') +# ifdef EBCDIC /* There are gaps between 'i' and 'j'; 'r' and 's'. Same + for uppercase. This is ordered to exclude most things + early */ +# define isLOWER_A(c) ((c) >= 'a' && (c) <= 'z' \ + && ((c) <= 'i' \ + || ((c) >= 'j' && (c) <= 'r') \ + || (c) >= 's')) +# define isUPPER_A(c) ((c) >= 'A' && (c) <= 'Z' \ + && ((c) <= 'I' \ + || ((c) >= 'J' && (c) <= 'R') \ + || (c) >= 'S')) +# else /* ASCII platform. */ # define isLOWER_A(c) ((c) >= 'a' && (c) <= 'z') -# define isPRINT_A(c) (((c) >= 32 && (c) < 127)) # define isUPPER_A(c) ((c) <= 'Z' && (c) >= 'A') -# endif /* Below are common definitions for ASCII and non-ASCII */ +# endif + + /* Some more ASCII, non-ASCII common definitions */ # define isALPHA_A(c) (isUPPER_A(c) || isLOWER_A(c)) # define isALPHANUMERIC_A(c) (isALPHA_A(c) || isDIGIT_A(c)) -# define isBLANK_A(c) ((c) == ' ' || (c) == '\t') -# define isCNTRL_A(c) (isASCII(c) && (! isPRINT_A(c))) -# define isDIGIT_A(c) ((c) <= '9' && (c) >= '0') -# define isGRAPH_A(c) (isPRINT_A(c) && (c) != ' ') -# define isIDFIRST_A(c) (isALPHA_A(c) || (c) == '_') -# define isPUNCT_A(c) (isGRAPH_A(c) && (! isALPHANUMERIC_A(c))) -# define isSPACE_A(c) ((c) == ' ' \ - || (c) == '\t' \ - || (c) == '\n' \ - || (c) == '\r' \ - || (c) == '\v' \ - || (c) == '\f') # define isWORDCHAR_A(c) (isALPHANUMERIC_A(c) || (c) == '_') -# define isXDIGIT_A(c) (isDIGIT_A(c) \ - || ((c) >= 'a' && (c) <= 'f') \ +# define isIDFIRST_A(c) (isALPHA_A(c) || (c) == '_') +# define isXDIGIT_A(c) (isDIGIT_A(c) \ + || ((c) >= 'a' && (c) <= 'f') \ || ((c) <= 'F' && (c) >= 'A')) +# ifdef EBCDIC +# define isPUNCT_A(c) ((c) == '-' || (c) == '!' || (c) == '"' \ + || (c) == '#' || (c) == '$' || (c) == '%' \ + || (c) == '&' || (c) == '\'' || (c) == '(' \ + || (c) == ')' || (c) == '*' || (c) == '+' \ + || (c) == ',' || (c) == '.' || (c) == '/' \ + || (c) == ':' || (c) == ';' || (c) == '<' \ + || (c) == '=' || (c) == '>' || (c) == '?' \ + || (c) == '@' || (c) == '[' || (c) == '\\' \ + || (c) == ']' || (c) == '^' || (c) == '_' \ + || (c) == '`' || (c) == '{' || (c) == '|' \ + || (c) == '}' || (c) == '~') +# define isGRAPH_A(c) (isALPHANUMERIC_A(c) || isPUNCT_A(c)) +# define isPRINT_A(c) (isGRAPH_A(c) || (c) == ' ') + +# ifdef QUESTION_MARK_CTRL +# define _isQMC(c) ((c) == QUESTION_MARK_CTRL) +# else +# define _isQMC(c) 0 +# endif + + /* I (khw) can't think of a way to define all the ASCII controls + * without resorting to a libc (locale-sensitive) call. But we know + * that all controls but the question-mark one are in the range 0-0x3f. + * This makes sure that all the controls that have names are included, + * and all controls that are also considered ASCII in the locale. This + * may include more or fewer than what it actually should, but the + * wrong ones are less-important controls, so likely won't impact + * things (keep in mind that this is compiled only if perl.h isn't + * available). The question mark control is included if available */ +# define isCNTRL_A(c) (((c) < 0x40 && isascii(c)) \ + || (c) == '\0' || (c) == '\a' || (c) == '\b' \ + || (c) == '\f' || (c) == '\n' || (c) == '\r' \ + || (c) == '\t' || (c) == '\v' || _isQMC(c)) + +# define isASCII(c) (isCNTRL_A(c) || isPRINT_A(c)) +# else /* ASCII platform; things are simpler, and isASCII has already + been defined */ +# define isGRAPH_A(c) (((c) > ' ' && (c) < 127)) +# define isPRINT_A(c) (isGRAPH_A(c) || (c) == ' ') +# define isPUNCT_A(c) (isGRAPH_A(c) && (! isALPHANUMERIC_A(c))) +# define isCNTRL_A(c) (isASCII(c) && (! isPRINT_A(c))) +# endif + /* The _L1 macros may be unnecessary for the utilities; I (khw) added them - * during debugging, and it seems best to keep them. */ + * during debugging, and it seems best to keep them. We may be called + * without NATIVE_TO_LATIN1 being defined. On ASCII platforms, it doesn't + * do anything anyway, so make it not a problem */ +# if ! defined(EBCDIC) && ! defined(NATIVE_TO_LATIN1) +# define NATIVE_TO_LATIN1(ch) (ch) +# endif # define isPSXSPC_A(c) isSPACE_A(c) /* XXX Assumes SPACE matches '\v' */ # define isALPHA_L1(c) (isUPPER_L1(c) || isLOWER_L1(c)) # define isALPHANUMERIC_L1(c) (isALPHA_L1(c) || isDIGIT_A(c)) @@ -1192,6 +1244,7 @@ EXTCONST U32 PL_charclass[]; /* And these aren't accurate at all. They are useful only for above * Latin1, which utilities and bootstrapping don't deal with */ # define _IS_NON_FINAL_FOLD_ONLY_FOR_USE_BY_REGCOMP_DOT_C(c) 0 +# define _HAS_NONLATIN1_SIMPLE_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(c) 0 # define _HAS_NONLATIN1_FOLD_CLOSURE_ONLY_FOR_USE_BY_REGCOMP_DOT_C_AND_REGEXEC_DOT_C(c) 0 /* Many of the macros later in this file are defined in terms of these. By @@ -1205,10 +1258,12 @@ EXTCONST U32 PL_charclass[]; (FITS_IN_8_BITS(c) && S_bootstrap_ctype((U8) (c), (classnum), TRUE)) # define _generic_isCC_A(c, classnum) \ (FITS_IN_8_BITS(c) && S_bootstrap_ctype((U8) (c), (classnum), FALSE)) -#endif /* End of no perl.h */ +#endif /* End of no perl.h H_PERL */ #define isALPHANUMERIC(c) isALPHANUMERIC_A(c) #define isALPHA(c) isALPHA_A(c) +#define isASCII_A(c) isASCII(c) +#define isASCII_L1(c) isASCII(c) #define isBLANK(c) isBLANK_A(c) #define isCNTRL(c) isCNTRL_A(c) #define isDIGIT(c) isDIGIT_A(c) @@ -1559,7 +1614,7 @@ EXTCONST U32 PL_charclass[]; #ifdef EBCDIC /* Because all controls are UTF-8 invariants in EBCDIC, we can use this * more efficient macro instead of the more general one */ -# define isCNTRL_utf8(p) isCNTRL_L1(p) +# define isCNTRL_utf8(p) isCNTRL_L1(*(p)) #else # define isCNTRL_utf8(p) _generic_utf8(_CC_CNTRL, p, 0) #endif @@ -1665,17 +1720,19 @@ EXTCONST U32 PL_charclass[]; * the outlier from the block that contains the other controls, just like * toCTRL('?') on ASCII yields DEL, the control that is the outlier from the C0 * block. If it weren't special cased, it would yield a non-control. - * The conversion works both ways, so CTRL('D') is 4, and CTRL(4) is D, etc. */ + * The conversion works both ways, so toCTRL('D') is 4, and toCTRL(4) is D, + * etc. */ #ifndef EBCDIC -# define toCTRL(c) (toUPPER(c) ^ 64) +# define toCTRL(c) (__ASSERT_(FITS_IN_8_BITS(c)) toUPPER(c) ^ 64) #else -# define toCTRL(c) ((isPRINT_A(c)) \ - ? UNLIKELY((c) == '?') \ - ? QUESTION_MARK_CTRL \ - : (NATIVE_TO_LATIN1(toUPPER(c)) ^ 64) \ - : UNLIKELY((c) == QUESTION_MARK_CTRL) \ - ? ((c) == '?') \ - : (LATIN1_TO_NATIVE((c) ^ 64))) +# define toCTRL(c) (__ASSERT_(FITS_IN_8_BITS(c)) \ + ((isPRINT_A(c)) \ + ? (UNLIKELY((c) == '?') \ + ? QUESTION_MARK_CTRL \ + : (NATIVE_TO_LATIN1(toUPPER(c)) ^ 64)) \ + : (UNLIKELY((c) == QUESTION_MARK_CTRL) \ + ? '?' \ + : (LATIN1_TO_NATIVE((c) ^ 64))))) #endif /* Line numbers are unsigned, 32 bits. */ diff --git a/hints/os390.sh b/hints/os390.sh index 5aafb4e..d9b0f8a 100644 --- a/hints/os390.sh +++ b/hints/os390.sh @@ -158,10 +158,10 @@ esac # under a compiler other than c89. case "$usedl" in define) -echo 'cat >.$$.c; '"$cc"' -D_OE_SOCKETS -D_XOPEN_SOURCE_EXTENDED -D_ALL_SOURCE -D_SHR_ENVIRON -E -Wc,"LANGLVL(DOLLARINNAMES)",NOLOC ${1+"$@"} .$$.c | grep -v "??="; rm .$$.c' > cppstdin +echo 'cat >.$$.c; '"$cc"' -D_OE_SOCKETS -D_XOPEN_SOURCE_EXTENDED -D_ALL_SOURCE -D_SHR_ENVIRON -E -Wc,"LANGLVL(DOLLARINNAMES)",NOLOC ${1+"$@"} .$$.c | fgrep -v "??="; rm .$$.c' > cppstdin ;; *) -echo 'cat >.$$.c; '"$cc"' -D_OE_SOCKETS -D_XOPEN_SOURCE_EXTENDED -D_ALL_SOURCE -E -Wc,"LANGLVL(DOLLARINNAMES)",NOLOC ${1+"$@"} .$$.c | grep -v "??="; rm .$$.c' > cppstdin +echo 'cat >.$$.c; '"$cc"' -D_OE_SOCKETS -D_XOPEN_SOURCE_EXTENDED -D_ALL_SOURCE -E -Wc,"LANGLVL(DOLLARINNAMES)",NOLOC ${1+"$@"} .$$.c | fgrep -v "??="; rm .$$.c' > cppstdin ;; esac @@ -231,3 +231,22 @@ d_gethostbyaddr_r='undef' d_gethostbyname_r='undef' d_gethostent_r='undef' +# The z/OS C compiler compiler supports the attribute keyword, but in a +# limited manner. +# +# Ideally, Configure's tests should test the attributes as they are expected +# to be used in perl, and, ideally, those tests would fail on z/OS. +# Until then, just tell Configure to ignore the attributes. Currently, +# Configure thinks attributes are supported because it does not recognize +# warning messages like this: +# +# INFORMATIONAL CCN4108 ./proto.h:4534  The use of keyword '__attribute__' is non-portable. + +d_attribute_deprecated='undef' +d_attribute_format='undef' +d_attribute_malloc='undef' +d_attribute_nonnull='undef' +d_attribute_noreturn='undef' +d_attribute_pure='undef' +d_attribute_unused='undef' +d_attribute_warn_unused_result='undef' diff --git a/lib/locale.t b/lib/locale.t index f59e17b..54dbd38 100644 --- a/lib/locale.t +++ b/lib/locale.t @@ -2075,10 +2075,10 @@ foreach my $Locale (@Locale) { "; lc=", disp_chars(($y)), "; ", "; fc=", disp_chars((fc $x)), "; ", disp_chars(($x)), "=~/", disp_chars(($y)), "/i=", - $x =~ /$y/i ? 1 : 0, + $x =~ /\Q$y/i ? 1 : 0, "; ", disp_chars(($y)), "=~/", disp_chars(($x)), "/i=", - $y =~ /$x/i ? 1 : 0, + $y =~ /\Q$x/i ? 1 : 0, "\n"); # # If $x and $y contain regular expression characters @@ -2108,7 +2108,7 @@ foreach my $Locale (@Locale) { print "# Regex characters in '$x' or '$y', skipping test $locales_test_number for locale '$Locale'\n"; next; } - push @f, $x unless $x =~ /$y/i && $y =~ /$x/i; + push @f, $x unless $x =~ /\Q$y/i && $y =~ /\Q$x/i; # fc is not a locale concept, so Perl uses lc for it. push @f, $x unless lc $x eq fc $x; @@ -2121,13 +2121,13 @@ foreach my $Locale (@Locale) { "; lc=", disp_chars(($y)), "; ", "; fc=", disp_chars((fc $x)), "; ", disp_chars(($x)), "=~/", disp_chars(($y)), "/i=", - $x =~ /$y/i ? 1 : 0, + $x =~ /\Q$y/i ? 1 : 0, "; ", disp_chars(($y)), "=~/", disp_chars(($x)), "/i=", - $y =~ /$x/i ? 1 : 0, + $y =~ /\Q$x/i ? 1 : 0, "\n"); - push @f, $x unless $x =~ /$y/i && $y =~ /$x/i; + push @f, $x unless $x =~ /\Q$y/i && $y =~ /\Q$x/i; # The places where Unicode's lc is different from fc are # skipped here by virtue of the 'next unless uc...' line above @@ -2143,16 +2143,16 @@ foreach my $Locale (@Locale) { "; uc=", disp_chars(($y)), "; ", "; fc=", disp_chars((fc $x)), "; ", disp_chars(($x)), "=~/", disp_chars(($y)), "/i=", - $x =~ /$y/i ? 1 : 0, + $x =~ /\Q$y/i ? 1 : 0, "; ", disp_chars(($y)), "=~/", disp_chars(($x)), "/i=", - $y =~ /$x/i ? 1 : 0, + $y =~ /\Q$x/i ? 1 : 0, "\n"); if ($x =~ $re || $y =~ $re) { # See above. print "# Regex characters in '$x' or '$y', skipping test $locales_test_number for locale '$Locale'\n"; next; } - push @f, $x unless $x =~ /$y/i && $y =~ /$x/i; + push @f, $x unless $x =~ /\Q$y/i && $y =~ /\Q$x/i; push @f, $x unless lc $x eq fc $x; } @@ -2164,12 +2164,12 @@ foreach my $Locale (@Locale) { "; uc=", disp_chars(($y)), "; ", "; fc=", disp_chars((fc $x)), "; ", disp_chars(($x)), "=~/", disp_chars(($y)), "/i=", - $x =~ /$y/i ? 1 : 0, + $x =~ /\Q$y/i ? 1 : 0, "; ", disp_chars(($y)), "=~/", disp_chars(($x)), "/i=", - $y =~ /$x/i ? 1 : 0, + $y =~ /\Q$x/i ? 1 : 0, "\n"); - push @f, $x unless $x =~ /$y/i && $y =~ /$x/i; + push @f, $x unless $x =~ /\Q$y/i && $y =~ /\Q$x/i; push @f, $x unless lc $x eq fc $x; } diff --git a/lib/unicore/mktables b/lib/unicore/mktables index ffbfe74..a21aa1a 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -6999,11 +6999,14 @@ END } } - # I (khw) have never waded through this line to - # understand it well enough to comment it. + # The unpack yields a list of the bytes that comprise the + # UTF-8 of $code_point, which are each placed in \xZZ format + # and output in the %s to map to $tostr, so the result looks + # like: + # "\xC4\xB0" => "\x{0069}\x{0307}", my $utf8 = sprintf(qq["%s" => "$tostr",], join("", map { sprintf "\\x%02X", $_ } - unpack("U0C*", pack("U", $code_point)))); + unpack("U0C*", chr $code_point))); # Add a comment so that a human reader can more easily # see what's going on. diff --git a/locale.c b/locale.c index 2e68b23..0bf234c 100644 --- a/locale.c +++ b/locale.c @@ -1425,6 +1425,8 @@ Perl__is_cur_LC_category_utf8(pTHX_ int category) #endif /* the code that is compiled when no nl_langinfo */ +#ifndef EBCDIC /* On os390, even if the name ends with "UTF-8', it isn't a + UTF-8 locale */ /* As a last resort, look at the locale name to see if it matches * qr/UTF -? * 8 /ix, or some other common locale names. This "name", the * return of setlocale(), is actually defined to be opaque, so we can't @@ -1464,6 +1466,7 @@ Perl__is_cur_LC_category_utf8(pTHX_ int category) "Locale %s doesn't end with UTF-8 in name\n", save_input_locale)); } +#endif #ifdef WIN32 /* http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx */ diff --git a/makedepend.SH b/makedepend.SH index f992af3..9870c3f 100755 --- a/makedepend.SH +++ b/makedepend.SH @@ -141,34 +141,46 @@ for file in `$cat .clist`; do # && defined(BAR) /* comment */ \ # && defined(BAZ) /* comment */ \ # etc. - # This code processes these latter situations first; it assumes there is - # at most one straightforward comment per continued preprocessor line. (It - # would be easier to handle more general cases if sed had a non-greedy '*' - # quantifier; but typically preprocessor directive lines are rather - # simple.) The continuation line is joined, and the process repeated on - # the enlarged line as long as there are continuations. At the end, if - # there are any comments remaining, they should be like the first situation, - # and can just be deleted. (Subsequent lines of the comment are irrelevant - # and get dropped.) - ( $echo "#line 2 \"$file\""; \ - $sed -n <$file \ - -e "/^${filebase}_init(/q" \ - -e ': testcont' \ - -e '/^[ ]*#/s|/\*.*\*/||' \ - -e '/\\$/{' \ - -e 'N' \ - -e 'b testcont' \ - -e '}' \ - -e 's/\\\n/ /g' \ - -e '/^#line/d' \ - -e '/^[ ]*#/{' \ - -e 's|/\*.*$||' \ - -e p \ + # Also, in lines like + # #defined FOO(a,b) a/**/b + # the comment may be important and so needs to be retained. + # This code processes the single-line comments first; it assumes there is + # at most one straightforward comment per continued preprocessor line, + # replacing each non-empty comment (and its surrounding white space) by a + # single space. (sed only has a greedy '*' quantifier, so this doesn't + # work right if there are multiple comments per line, and strings can look + # like comments to it; both are unlikely in a preprocessor statement.) Any + # continuation line is joined, and the process repeated on the enlarged + # line as long as there are continuations. At the end, if there are any + # comments remaining, they are either completely empty or are like the + # first situation. The latter are just deleted by first deleting to the + # end of line (including preceding white space) things that start with '/*' + # and the next char isn't a '*'; then things that start with '/**', but the + # next char isn't a '/'. (Subsequent lines of the comment are irrelevant + # and get dropped.) At the end, we unjoin very long lines to avoid + # preprocessor limitations + ( $echo "#line 2 \"$file\""; \ + $sed -n <$file \ + -e "/^${filebase}_init(/q" \ + -e ': testcont' \ + -e '/^[ ]*#/s|[ ]*/\*..*\*/[ ]*| |' \ + -e '/\\$/{' \ + -e 'N' \ + -e 'b testcont' \ + -e '}' \ + -e 's/\\\n//g' \ + -e '/^#line/d' \ + -e '/^[ ]*#/{' \ + -e 's|[ ]*/\*[^*].*$||' \ + -e 's|[ ]*/\*\*[^/].*$||' \ + -e 's/.\{255\}/&\\\n/g' \ + -e p \ -e '}' ) >UU/$file.c - if [ "$osname" = os390 -a "$file" = perly.c ]; then - $echo '#endif' >>UU/$file.c - fi + # We're not sure why this was there; the #endif is extraneous on modern z/OS + #if [ "$osname" = os390 -a "$file" = perly.c ]; then + # $echo '#endif' >>UU/$file.c + #fi if [ "$osname" = os390 ]; then $cppstdin $finc -I. $cppflags $cppminus <UU/$file.c | diff --git a/nostdio.h b/nostdio.h index ef8d652..c815fd8 100644 --- a/nostdio.h +++ b/nostdio.h @@ -25,6 +25,8 @@ struct _FILE; #define FILE struct _FILE #endif +#ifndef EBCDIC + #define _CANNOT "CANNOT" #undef clearerr @@ -125,6 +127,8 @@ struct _FILE; #define pclose(f) _CANNOT _pclose_ #endif +#endif /*not define EBCDIC */ + /* * Local variables: * c-indentation-style: bsd diff --git a/perlio.c b/perlio.c index 6c742d2..19b73ab 100644 --- a/perlio.c +++ b/perlio.c @@ -2930,11 +2930,27 @@ PerlIO_importFILE(FILE *stdio, const char *mode) { dTHX; PerlIO *f = NULL; +#ifdef EBCDIC + int rc; + char filename[FILENAME_MAX]; + fldata_t fileinfo; +#endif if (stdio) { PerlIOStdio *s; int fd0 = fileno(stdio); if (fd0 < 0) { +#ifdef EBCDIC + rc = fldata(stdio,filename,&fileinfo); + if(rc != 0){ + return NULL; + } + if(fileinfo.__dsorgHFS){ + return NULL; + } + /*This MVS dataset , OK!*/ +#else return NULL; +#endif } if (!mode || !*mode) { /* We need to probe to see how we can open the stream @@ -2966,7 +2982,24 @@ PerlIO_importFILE(FILE *stdio, const char *mode) if ((f = PerlIO_push(aTHX_(PerlIO_allocate(aTHX)), PERLIO_FUNCS_CAST(&PerlIO_stdio), mode, NULL))) { s = PerlIOSelf(f, PerlIOStdio); s->stdio = stdio; +#ifdef EBCDIC + fd0 = fileno(stdio); + if(fd0 != -1){ + PerlIOUnix_refcnt_inc(fd0); + } + else{ + rc = fldata(stdio,filename,&fileinfo); + if(rc != 0){ + PerlIOUnix_refcnt_inc(fd0); + } + if(fileinfo.__dsorgHFS){ + PerlIOUnix_refcnt_inc(fd0); + } + /*This MVS dataset , OK!*/ + } +#else PerlIOUnix_refcnt_inc(fileno(stdio)); +#endif } } return f; diff --git a/pod/perldata.pod b/pod/perldata.pod index c490b63..b0f5e7e 100644 --- a/pod/perldata.pod +++ b/pod/perldata.pod @@ -188,31 +188,49 @@ fully-qualified. They come in four forms: =over -=item A sigil, followed solely by digits matching \p{POSIX_Digit}, like C<$0>, -C<$1>, or C<$10000>. - -=item A sigil, followed by either a caret and a single POSIX uppercase letter, -like C<$^V> or C<$^W>, or a sigil followed by a literal control character -matching the C<\p{POSIX_Cntrl}> property. -Due to a historical oddity, if not -running under C<use utf8>, the 128 extra controls in the C<[0x80-0xff]> range -may also be used in length one variables. The use of a literal control -character is deprecated. Support for this form will be removed in a future -version of perl. - -=item Similar to the above, a sigil, followed by bareword text in brackets, -where the first character is either a caret followed by an uppercase letter, -or a literal control, like C<${^GLOBAL_PHASE}> or C<${\7LOBAL_PHASE}>. The use -of a literal control character is deprecated. Support for this form will be -removed in a future version of perl. - -=item A sigil followed by a single character matching the C<\p{POSIX_Punct}> -property, like C<$!> or C<%+>. +=item * + +A sigil, followed solely by digits matching C<\p{POSIX_Digit}>, like +C<$0>, C<$1>, or C<$10000>. + +=item * + +A sigil, followed by either a caret and a single POSIX uppercase letter, +like C<$^V> or C<$^W>, or a sigil followed by a literal non-space, +non-C<NUL> control character matching the C<\p{POSIX_Cntrl}> property. +Due to a historical oddity, if not running under C<use utf8>, the 128 +characters in the C<[0x80-0xff]> range are considered to be controls, +and may also be used in length-one variables. However, the use of +non-graphical characters is deprecated as of v5.22, and support for them +will be removed in a future version of perl. ASCII space characters and +C<NUL> already aren't allowed, so this means that a single-character +variable name with that name being any other C0 control C<[0x01-0x1F]>, +or C<DEL> will generate a deprecated warning. Already, under C<"use +utf8">, non-ASCII characters must match C<Perl_XIDS>. As of v5.22, when +not under C<"use utf8"> C1 controls C<[0x80-0x9F]>, NO BREAK SPACE, and +SOFT HYPHEN (C<SHY>)) generate a deprecated warning. + +=item * + +Similar to the above, a sigil, followed by bareword text in brackets, +where the first character is either a caret followed by an uppercase +letter, like C<${^GLOBAL_PHASE}> or a non-C<NUL>, non-space literal +control like C<${\7LOBAL_PHASE}>. Like the above, when not under +C<"use utf8">, the characters in C<[0x80-0xFF]> are considered controls, but as +of v5.22, the use of any that are non-graphical are deprecated, and as +of v5.20 the use of any ASCII-range literal control is deprecated. +Support for these will be removed in a future version of perl. + +=item * + +A sigil followed by a single character matching the C<\p{POSIX_Punct}> +property, like C<$!> or C<%+>, except the character C<"{"> doesn't work. =back Note that as of Perl 5.20, literal control characters in variable names -are deprecated. +are deprecated; and as of Perl 5.22, any other non-graphic characters +are also deprecated. =head2 Context X<context> X<scalar context> X<list context> diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 2c611ca..8e9aa2d 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -78,6 +78,17 @@ as an updated module in the L</Modules and Pragmata> section. [ List each other deprecation as a =head2 entry ] +=head2 Use of non-graphic characters in single-character variable names + +The syntax for single-character variable names is more lenient than +for longer variable names, allowing the one-character name to be a +punctuation character or even invisible (a non-graphic). Perl v5.20 +deprecated the ASCII-range controls as such a name. Now, all +non-graphic characters that formerly were allowed are deprecated. +The practical effect of this occurs only when not under C<S<"use +utf8">>, and affects just the C1 controls (code points 0x80 through +0xFF), NO-BREAK SPACE, and SOFT HYPHEN. + =head1 Performance Enhancements XXX Changes which enhance performance without changing behaviour go here. @@ -152,13 +163,14 @@ XXX Changes which significantly change existing files in F<pod/> go here. However, any changes to F<pod/perldiag.pod> should go in the L</Diagnostics> section. -=head3 L<XXX> +=head3 L<perldata/Identifier parsing> =over 4 =item * -XXX Description of the change here +The syntax of single-character variable names has been brought +up-to-date and more fully explained. =back @@ -192,7 +204,7 @@ XXX L<message|perldiag/"message"> =item * -XXX L<message|perldiag/"message"> +L<Use of literal non-graphic characters in variable names is deprecated|perldiag/"Use of literal non-graphic characters in variable names is deprecated"> =back diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 863caf1..fbfdb93 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -1181,6 +1181,13 @@ probably because you don't have write permission to the directory. (P) An error peculiar to VMS. Perl thought stdin was a pipe, and tried to reopen it to accept binary data. Alas, it failed. +=item Can't represent character for Ox%X on this platform + +(F) There is a hard limit to how big a character code point can be due +to the fundamental properties of UTF-8, especially on EBCDIC +platforms. The given code point exceeds that. The only work-around is +to not use such a large code point. + =item Can't reset %ENV on this system (F) You called C<reset('E')> or similar, which tried to reset @@ -5056,6 +5063,14 @@ missing the final closing quote or angle bracket after the name. The S<<-- HERE> shows whereabouts in the regular expression the problem was discovered. +=item Sequence "\c{" invalid + +(F) These three characters may not appear in sequence in a +double-quotish context. This message is raised only on non-ASCII +platforms (a different error message is output on ASCII ones). If you +were intending to specify a control character with this sequence, you'll +have to use a different way to specify it. + =item Sequence \%s... not terminated in regex; marked by S<<-- HERE> in m/%s/ @@ -6517,10 +6532,19 @@ old way has bad side effects. =item Use of literal control characters in variable names is deprecated -(D deprecated) Using literal control characters in the source to refer -to the ^FOO variables, like C<$^X> and C<${^GLOBAL_PHASE}> is now -deprecated. This only affects code like C<$\cT>, where \cT is a control in -the source code: C<${"\cT"}> and C<$^T> remain valid. +=item Use of literal non-graphic characters in variable names is deprecated + +(D deprecated) Using literal non-graphic (including control) characters +in the source to refer to the ^FOO variables, like C<$^X> and +C<${^GLOBAL_PHASE}> is now deprecated. (We use C<^X> and C<^G> here for +legibility. They actually represent the non-printable control +characters, code points 0x18 and 0x07, respectively; C<^A> would mean +the control character whose code point is 0x01.) This only affects code +like +C<$\cT>, where C<\cT> is a control in the source code; C<${"\cT"}> and +C<$^T> remain valid. Things that are non-controls and also not graphic +are NO-BREAK SPACE and SOFT HYPHEN, which were previously only allowed +for historical reasons. =item Use of -l on filehandle%s diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod index 45a6e54..0a99be8 100644 --- a/pod/perlebcdic.pod +++ b/pod/perlebcdic.pod @@ -799,7 +799,9 @@ or regex, as it will absorb the terminator. But C<\c\I<X>> is a C<FILE SEPARATOR> concatenated with I<X> for all I<X>. The outlier C<\c?> on ASCII, which yields a non-C0 control C<DEL>, yields the outlier control C<APC> on EBCDIC, the one that isn't in the -block of contiguous controls. +block of contiguous controls. Note that a subtlety of this is that +C<\c?> on ASCII platforms is an ASCII character, while it isn't +equivalent to any ASCII character in EBCDIC platforms. chr ord 8859-1 0037 1047 && POSIX-BC ----------------------------------------------------------------------- diff --git a/regcomp.c b/regcomp.c index 5fe3c9c..a62e3e1 100644 --- a/regcomp.c +++ b/regcomp.c @@ -11364,7 +11364,7 @@ S_alloc_maybe_populate_EXACT(pTHX_ RExC_state_t *pRExC_state, if (! len_passed_in) { if (UTF) { - if (UNI_IS_INVARIANT(code_point)) { + if (UVCHR_IS_INVARIANT(code_point)) { if (LOC || ! FOLD) { /* /l defers folding until runtime */ *character = (U8) code_point; } @@ -12473,7 +12473,7 @@ tryagain: * the simple case just below.) */ UV folded; - if (isASCII(ender)) { + if (isASCII_uni(ender)) { folded = toFOLD(ender); *(s)++ = (U8) folded; } diff --git a/t/charset_tools.pl b/t/charset_tools.pl index 6abf902..6d70a37 100644 --- a/t/charset_tools.pl +++ b/t/charset_tools.pl @@ -29,7 +29,7 @@ sub latin1_to_native($) { return $string if $::IS_ASCII; my $output = ""; for my $i (0 .. length($string) - 1) { - $output .= chr(ord_latin1_to_native(ord(substr($string, $i, 1)))); + $output .= chr(utf8::unicode_to_native(ord(substr($string, $i, 1)))); } # Preserve utf8ness of input onto the output, even if it didn't need to be # utf8 diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke index 8c0158a..cf0d020 100644 --- a/t/lib/warnings/toke +++ b/t/lib/warnings/toke @@ -150,6 +150,12 @@ EXPECT Use of bare << to mean <<"" is deprecated at - line 2. ######## # toke.c +BEGIN { + if (ord('A') == 193) { + print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC"; + exit 0; + } +} eval "\$\cT"; eval "\${\7LOBAL_PHASE}"; eval "\${\cT}"; @@ -1360,11 +1366,30 @@ EXPECT "\c`" is more clearly written simply as "\ " at - line 4. ######## # toke.c +BEGIN { + if (ord('A') == 193) { + print "SKIPPED\n# test is ASCII-specific"; + exit 0; + } +} +use warnings; +my $a = "\c{ack}"; +EXPECT +OPTION fatal +Use ";" instead of "\c{" at - line 9. +######## +# toke.c +BEGIN { + if (ord('A') == 65) { + print "SKIPPED\n# test is EBCDIC-specific"; + exit 0; + } +} use warnings; my $a = "\c{ack}"; EXPECT OPTION fatal -Use ";" instead of "\c{" at - line 3. +Sequence "\c{" invalid at - line 9. ######## # toke.c my $a = "\câ"; @@ -1472,6 +1497,12 @@ I ######## # toke.c #[perl #119123] disallow literal control character variables +BEGIN { + if (ord('A') == 193) { + print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC"; + exit 0; + } +} eval "\$\cQ = 25"; eval "\${ \cX } = 24"; *{ diff --git a/t/lib/warnings/utf8 b/t/lib/warnings/utf8 index 9004731..614d5ec 100644 --- a/t/lib/warnings/utf8 +++ b/t/lib/warnings/utf8 @@ -80,16 +80,26 @@ Operation "uc" returns its argument for non-Unicode code point 0x110000 at - lin use warnings 'utf8'; my $d800 = uc(chr(0xD800)); my $nonUnicode = uc(chr(0x110000)); -my $big_nonUnicode = uc(chr(0x8000_0000)); no warnings 'non_unicode'; my $d800 = uc(chr(0xD800)); my $nonUnicode = uc(chr(0x110000)); -my $big_nonUnicode = uc(chr(0x8000_0000)); EXPECT Operation "uc" returns its argument for UTF-16 surrogate U+D800 at - line 2. Operation "uc" returns its argument for non-Unicode code point 0x110000 at - line 3. -Operation "uc" returns its argument for non-Unicode code point 0x80000000 at - line 4. -Operation "uc" returns its argument for UTF-16 surrogate U+D800 at - line 6. +Operation "uc" returns its argument for UTF-16 surrogate U+D800 at - line 5. +######## +BEGIN { + if (ord('A') == 193) { + print "SKIPPED\n# ebcdic platforms can't handle this large a code point"; + exit 0; + } +} +use warnings 'utf8'; +my $big_nonUnicode = uc(chr(0x8000_0000)); +no warnings 'non_unicode'; +my $big_nonUnicode = uc(chr(0x8000_0000)); +EXPECT +Operation "uc" returns its argument for non-Unicode code point 0x80000000 at - line 8. ######## use warnings 'utf8'; my $d7ff = lc pack("U", 0xD7FF); diff --git a/t/loc_tools.pl b/t/loc_tools.pl index fccbeeb..5406368 100644 --- a/t/loc_tools.pl +++ b/t/loc_tools.pl @@ -209,6 +209,10 @@ sub find_locales ($) { # Returns an array of all the locales we found on the sub is_locale_utf8 ($) { # Return a boolean as to if core Perl thinks the input # is a UTF-8 locale + + # On z/OS, even locales marked as UTF-8 aren't. + return 0 if ord "A" != 65; + my $locale = shift; use locale; diff --git a/t/re/regexp.t b/t/re/regexp.t index 7e104db..8c51ea5 100644 --- a/t/re/regexp.t +++ b/t/re/regexp.t @@ -3,6 +3,7 @@ # The tests are in a separate file 't/re/re_tests'. # Each line in that file is a separate test. # There are five columns, separated by tabs. +# An optional sixth column is used to give a reason, only when skipping tests # # Column 1 contains the pattern, optionally enclosed in C<''>. # Modifiers can be put after the closing C<'>. @@ -20,6 +21,8 @@ # t test exposes a bug with threading, TODO if qr_embed_thr # s test should only be run for regex_sets_compat.t # S test should not be run for regex_sets_compat.t +# a test should only be run on ASCII platforms +# e test should only be run on EBCDIC platforms # # Columns 4 and 5 are used only if column 3 contains C<y> or C<c>. # @@ -47,6 +50,9 @@ # # Note that columns 2,3 and 5 are all enclosed in double quotes and then # evalled; so something like a\"\x{100}$1 has length 3+length($1). +# +# \x... and \o{...} constants are automatically converted to the native +# character set if necessary. \[0-7] constants aren't my ($file, $iters); BEGIN { @@ -71,6 +77,24 @@ sub _comment { map { split /\n/ } @_; } +sub convert_from_ascii { + my $string = shift; + + #my $save = $string; + # Convert \x{...}, \o{...} + $string =~ s/ (?<! \\ ) \\x\{ ( .*? ) } / "\\x{" . sprintf("%X", utf8::unicode_to_native(hex $1)) . "}" /gex; + $string =~ s/ (?<! \\ ) \\o\{ ( .*? ) } / "\\o{" . sprintf("%o", utf8::unicode_to_native(oct $1)) . "}" /gex; + + # Convert \xAB + $string =~ s/ (?<! \\ ) \\x ( [A-Fa-f0-9]{2} ) / "\\x" . sprintf("%02X", utf8::unicode_to_native(hex $1)) /gex; + + # Convert \xA + $string =~ s/ (?<! \\ ) \\x ( [A-Fa-f0-9] ) (?! [A-Fa-f0-9] ) / "\\x" . sprintf("%X", utf8::unicode_to_native(hex $1)) /gex; + + #print STDERR __LINE__, ": $save\n$string\n" if $save ne $string; + return $string; +} + use strict; use warnings FATAL=>"all"; use vars qw($bang $ffff $nulnul); # used by the tests @@ -113,13 +137,20 @@ foreach (@tests) { } $reason = '' unless defined $reason; my $input = join(':',$pat,$subject,$result,$repl,$expect); + # the double '' below keeps simple syntax highlighters from going crazy $pat = "'$pat'" unless $pat =~ /^[:''\/]/; $pat =~ s/(\$\{\w+\})/$1/eeg; $pat =~ s/\\n/\n/g unless $regex_sets; + $pat = convert_from_ascii($pat) if ord("A") != 65; + + $subject = convert_from_ascii($subject) if ord("A") != 65; $subject = eval qq("$subject"); die $@ if $@; + + $expect = convert_from_ascii($expect) if ord("A") != 65; $expect = eval qq("$expect"); die $@ if $@; $expect = $repl = '-' if $skip_amp and $input =~ /\$[&\`\']/; + my $todo_qr = $qr_embed_thr && ($result =~ s/t//); my $skip = ($skip_amp ? ($result =~ s/B//i) : ($result =~ s/B//)); ++$skip if $result =~ s/M// && !defined &DynaLoader::boot_DynaLoader; @@ -129,6 +160,14 @@ foreach (@tests) { $reason = "Test not valid for $0"; } } + if ($result =~ s/a// && ord("A") != 65) { + $skip++; + $reason = "Test is only valid for ASCII platforms. $reason"; + } + if ($result =~ s/e// && ord("A") != 193) { + $skip++; + $reason = "Test is only valid for EBCDIC platforms. $reason"; + } $reason = 'skipping $&' if $reason eq '' && $skip_amp; $result =~ s/B//i unless $skip; my $todo= $result =~ s/T// ? " # TODO" : ""; diff --git a/t/uni/variables.t b/t/uni/variables.t index e441000..e8259e5 100644 --- a/t/uni/variables.t +++ b/t/uni/variables.t @@ -14,7 +14,7 @@ use utf8; use open qw( :utf8 :std ); no warnings qw(misc reserved); -plan (tests => 65880); +plan (tests => 66900); # ${single:colon} should not be valid syntax { @@ -65,72 +65,200 @@ for my $v (qw( ^V ; < > ( ) {^GLOBAL_PHASE} ^W _ 1 4 0 [ ] ! @ / \ = )) { local $@; eval "use utf8; \$$v;"; - is $@, '', "No syntax error for \$$v under use utf8"; + is $@, '', "No syntax error for \$$v under 'use utf8'"; } } # Checking if the Latin-1 range behaves as expected, and that the behavior is the # same whenever under strict or not. -for ( 0x80..0xff ) { +for ( 0x0 .. 0xff ) { + my @warnings; + local $SIG {__WARN__} = sub {push @warnings, @_ }; + my $ord = utf8::unicode_to_native($_); + my $chr = chr $ord; + my $syntax_error = 0; # Do we expect this code point to generate a + # syntax error? Assume not, for now + my $deprecated = 0; + my $name; + + # A different number of tests are run depending on the branches in this + # loop iteration. This allows us to add skips to make the reported total + # the same for each iteration. + my $tests = 0; + my $max_tests = 6; + + if ($chr =~ /[[:graph:]]/a) { + $name = "'$chr'"; + $syntax_error = 1 if $chr eq '{'; + } + elsif ($chr =~ /[[:space:]]/a) { + $name = sprintf "\\x%02x, an ASCII space character", $ord; + $syntax_error = 1; + } + elsif ($chr =~ /[[:cntrl:]]/a) { + if ($chr eq "\N{NULL}") { + $name = sprintf "\\x%02x, NUL", $ord; + $syntax_error = 1; + } + else { + $name = sprintf "\\x%02x, an ASCII control", $ord; + $syntax_error = $::IS_EBCDIC; + $deprecated = ! $syntax_error; + } + } + elsif ($chr =~ /\pC/) { + if ($chr eq "\N{SHY}") { + $name = sprintf "\\x%02x, SHY", $ord; + } + else { + $name = sprintf "\\x%02x, a C1 control", $ord; + } + $syntax_error = $::IS_EBCDIC; + $deprecated = ! $syntax_error; + } + elsif ($chr =~ /\p{XIDStart}/) { + $name = sprintf "\\x%02x, a non-ASCII XIDS character", $ord; + } + elsif ($chr =~ /\p{XPosixSpace}/) { + $name = sprintf "\\x%02x, a non-ASCII space character", $ord; + $syntax_error = $::IS_EBCDIC; + $deprecated = ! $syntax_error; + } + else { + $name = sprintf "\\x%02x, a non-ASCII, non-XIDS graphic character", $ord; + } no warnings 'closure'; - my $chr = chr; - my $esc = sprintf("%X", ord $chr); + my $esc = sprintf("%X", $ord); utf8::downgrade($chr); if ($chr !~ /\p{XIDS}/u) { - is evalbytes "no strict; \$$chr = 10", - 10, - sprintf("\\x%02x, part of the latin-1 range, is legal as a length-1 variable", $_); - - utf8::upgrade($chr); - local $@; - eval "no strict; use utf8; \$$chr = 1"; - like $@, - qr/\QUnrecognized character \x{\E\L$esc/, - sprintf("..but is illegal as a length-1 variable under use utf8", $_); + if ($syntax_error) { + evalbytes "\$$chr"; + like($@, qr/ syntax\ error | Unrecognized\ character /x, + "$name as a length-1 variable generates a syntax error"); + $tests++; + } + elsif ($ord < 32 || chr =~ /[[:punct:][:digit:]]/a) { + + # Unlike other variables, we dare not try setting the length-1 + # variables that are \cX (for all valid X) nor ASCII ones that are + # punctuation nor digits. This is because many of these variables + # have meaning to the system, and setting them could have side + # effects or not work as expected (And using fresh_perl() doesn't + # always help.) For example, setting $^D (to use a visible + # representation of code point 0x04) turns on tracing, and setting + # $^E sets an error number, but what gets printed is instead a + # string associated with that number. For all these we just + # verify that they don't generate a syntax error. + local $@; + evalbytes "\$$chr;"; + is $@, '', "$name as a length-1 variable doesn't generate a syntax error"; + $tests++; + utf8::upgrade($chr); + evalbytes "no strict; use utf8; \$$chr;", + is $@, '', " ... and the same under 'use utf8'"; + $tests++; + } + else { + is evalbytes "no strict; \$$chr = 10", + 10, + "$name is legal as a length-1 variable"; + $tests++; + if ($chr =~ /[[:ascii:]]/) { + utf8::upgrade($chr); + is evalbytes "no strict; use utf8; \$$chr = 1", + 1, + " ... and is legal under 'use utf8'"; + $tests++; + } + else { + utf8::upgrade($chr); + local $@; + eval "no strict; use utf8; \$$chr = 1"; + like $@, + qr/\QUnrecognized character \x{\E\L$esc/, + " ... but is illegal as a length-1 variable under 'use utf8'"; + $tests++; + } + } } else { { no utf8; local $@; evalbytes "no strict; \$$chr = 1"; - is($@, '', sprintf("\\x%02x, =~ \\p{XIDS}, latin-1, no utf8, no strict, is a valid length-1 variable", $_)); - - local $@; - evalbytes "use strict; \$$chr = 1"; - is($@, - '', - sprintf("\\x%02x under no utf8 does not have to be required under strict, even though it matches XIDS", $_) - ); - - local $@; - evalbytes "\$a$chr = 1"; - like($@, - qr/Unrecognized character /, - sprintf("...but under no utf8, it's not allowed in two-or-more character variables") - ); - - local $@; - evalbytes "\$a$chr = 1"; - like($@, - qr/Unrecognized character /, - sprintf("...but under no utf8, it's not allowed in two-or-more character variables") - ); + is($@, '', "$name under 'no utf8', 'no strict', is a valid length-1 variable"); + $tests++; + + if ($chr !~ /[[:ascii:]]/) { + local $@; + evalbytes "use strict; \$$chr = 1"; + is($@, + '', + " ... and under 'no utf8' does not have to be required under strict, even though it matches XIDS" + ); + $tests++; + + local $@; + evalbytes "\$a$chr = 1"; + like($@, + qr/Unrecognized character /, + " ... but under 'no utf8', it's not allowed in length-2+ variables" + ); + $tests++; + } } { use utf8; - my $u = $chr; - utf8::upgrade($u); + my $utf8 = $chr; + utf8::upgrade($utf8); local $@; - eval "no strict; \$$u = 1"; - is($@, '', sprintf("\\x%02x, =~ \\p{XIDS}, UTF-8, use utf8, no strict, is a valid length-1 variable", $_)); + eval "no strict; \$$utf8 = 1"; + is($@, '', " ... and under 'use utf8', 'no strict', is a valid length-1 variable"); + $tests++; local $@; - eval "use strict; \$$u = 1"; - like($@, - qr/Global symbol "\$$u" requires explicit package name/, - sprintf("\\x%02x under utf8 has to be required under strict", $_) - ); + eval "use strict; \$$utf8 = 1"; + if ($chr =~ /[ab]/) { # These are special, for sort() + is($@, '', " ... and under 'use utf8', 'use strict'," + . " is a valid length-1 variable (\$a and \$b are special)"); + $tests++; + } + else { + like($@, + qr/Global symbol "\$$utf8" requires explicit package name/, + " ... and under utf8 has to be required under strict" + ); + $tests++; + } + } + } + + if (! $deprecated) { + if ($chr =~ /[#*]/) { + + # Length-1 variables with these two characters used to be used by + # Perl, but now their generates a warning that they're gone. + # Ignore such warnings. + for (my $i = @warnings - 1; $i >= 0; $i--) { + splice @warnings, $i, 1 if $warnings[$i] =~ /is no longer supported/; + } } + ok(@warnings == 0, " ... and doesn't generate any warnings"); + $tests++; + } + elsif (! @warnings) { + fail(" ... and generates deprecation warnings (since is deprecated)"); + $tests++; + } + else { + ok((scalar @warnings == grep { $_ =~ /deprecated/ } @warnings), + " ... and generates deprecation warnings (only)"); + $tests++; + } + + SKIP: { + die "Wrong max count for tests" if $tests > $max_tests; + skip("untaken tests", $max_tests - $tests) if $max_tests > $tests; } } @@ -138,7 +266,7 @@ for ( 0x80..0xff ) { use utf8; my $ret = eval "my \$c\x{327} = 100; \$c\x{327}"; # c + cedilla is($@, '', "ASCII character + combining character works as a variable name"); - is($ret, 100, "...and returns the correct value"); + is($ret, 100, " ... and returns the correct value"); } # From Tom Christiansen's 'highly illegal variable names are now accidentally legal' mail @@ -226,12 +354,16 @@ EOP no warnings 'deprecated'; for my $var ( '$', "\7LOBAL_PHASE", "^GLOBAL_PHASE", "^V" ) { + SKIP: { + skip("Literal control characters in variable names forbidden on EBCDIC", 3) + if ($::IS_EBCDIC && ord substr($var, 0, 1) < 32); eval "\${ $var}"; is($@, '', "\${ $var} works" ); eval "\${$var }"; is($@, '', "\${$var } works" ); eval "\${ $var }"; is($@, '', "\${ $var } works" ); + } } } } @@ -244,19 +376,30 @@ EOP ); - is( - "".eval "*{^JOIN}", - "*main::\nOIN", - "...but \$^J is still legal" - ); + SKIP: { + skip('Is $^U on EBCDIC 1047, BC; nothing works on 0037', 1) + if $::IS_EBCDIC; + is( + "".eval "*{^JOIN}", + "*main::\nOIN", + " ... but \$^J is still legal" + ); + } + SKIP: { + skip("Literal control characters in variable names forbidden on EBCDIC", 2) + if $::IS_EBCDIC; no warnings 'deprecated'; my $ret = eval "\${\cT\n}"; is($@, "", 'No errors from using ${\n\cT\n}'); - is($ret, $^T, "...and we got the right value"); + is($ret, $^T, " ... and we got the right value"); + } } -{ +SKIP: { + skip("Literal control characters in variable names forbidden on EBCDIC", 5) + if $::IS_EBCDIC; + # Originally from t/base/lex.t, moved here since we can't # turn deprecation warnings off in that file. no strict; @@ -273,9 +416,9 @@ EOP ); eval "\$\cQ = 24"; # Literal control character - is($@, "", "...and they can be assigned to without error"); - is(${"\cQ"}, 24, "...and the assignment works"); - is($^Q, 24, "...even if we access the variable through the caret name"); + is($@, "", " ... and they can be assigned to without error"); + is(${"\cQ"}, 24, " ... and the assignment works"); + is($^Q, 24, " ... even if we access the variable through the caret name"); is(\${"\cQ"}, \$^Q, '\${\cQ} == \$^Q'); } diff --git a/toke.c b/toke.c index 44d0fef..b653687 100644 --- a/toke.c +++ b/toke.c @@ -8550,25 +8550,52 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni) /* Is the byte 'd' a legal single character identifier name? 'u' is true * iff Unicode semantics are to be used. The legal ones are any of: - * a) ASCII digits - * b) ASCII punctuation + * a) all ASCII characters except: + * 1) space-type ones, like \t and SPACE; + 2) NUL; + * 3) '{' + * The final case currently doesn't get this far in the program, so we + * don't test for it. If that were to change, it would be ok to allow it. * c) When not under Unicode rules, any upper Latin1 character - * d) \c?, \c\, \c^, \c_, and \cA..\cZ, minus the ones that have traditionally - * been matched by \s on ASCII platforms. That is: \c?, plus 1-32, minus - * the \s ones. */ -#define VALID_LEN_ONE_IDENT(d, u) (isPUNCT_A((U8)(d)) \ - || isDIGIT_A((U8)(d)) \ - || (!(u) && !isASCII((U8)(d))) \ - || ((((U8)(d)) < 32) \ - && (((((U8)(d)) >= 14) \ - || (((U8)(d)) <= 8 && (d) != 0) \ - || (((U8)(d)) == 13)))) \ - || (((U8)(d)) == toCTRL('?'))) - if (s < PL_bufend - && (isIDFIRST_lazy_if(s, is_utf8) || VALID_LEN_ONE_IDENT(*s, is_utf8))) + * d) Otherwise, when unicode rules are used, all XIDS characters. + * + * Because all ASCII characters have the same representation whether + * encoded in UTF-8 or not, we can use the foo_A macros below and '\0' and + * '{' without knowing if is UTF-8 or not */ +#ifdef EBCDIC +# define VALID_LEN_ONE_IDENT(s, is_utf8) \ + (isGRAPH_A(*(s)) || ((is_utf8) \ + ? isIDFIRST_utf8((U8*) (s)) \ + : (isGRAPH_L1(*s) \ + && LIKELY((U8) *(s) != LATIN1_TO_NATIVE(0xAD))))) +#else +# define VALID_LEN_ONE_IDENT(s, is_utf8) (! isSPACE_A(*(s)) \ + && LIKELY(*(s) != '\0') \ + && (! is_utf8 \ + || isASCII_utf8((U8*) (s)) \ + || isIDFIRST_utf8((U8*) (s)))) +#endif + if ((s <= PL_bufend - (is_utf8) + ? UTF8SKIP(s) + : 1) + && VALID_LEN_ONE_IDENT(s, is_utf8)) { - if ( isCNTRL_A((U8)*s) ) { - deprecate("literal control characters in variable names"); + /* Deprecate all non-graphic characters. Include SHY as a non-graphic, + * because often it has no graphic representation. (We can't get to + * here with SHY when 'is_utf8' is true, so no need to include a UTF-8 + * test for it.) */ + if ((is_utf8) + ? ! isGRAPH_utf8( (U8*) s) + : (! isGRAPH_L1( (U8) *s) + || UNLIKELY((U8) *(s) == LATIN1_TO_NATIVE(0xAD)))) + { + /* Split messages for back compat */ + if (isCNTRL_A( (U8) *s)) { + deprecate("literal control characters in variable names"); + } + else { + deprecate("literal non-graphic characters in variable names"); + } } if (is_utf8) { diff --git a/utf8.c b/utf8.c index a7baed4..f42b1a2 100644 --- a/utf8.c +++ b/utf8.c @@ -107,6 +107,11 @@ Perl_uvoffuni_to_utf8_flags(pTHX_ U8 *d, UV uv, UV flags) return d; } +#ifdef EBCDIC + /* Not representable in UTF-EBCDIC */ + flags |= UNICODE_DISALLOW_FE_FF; +#endif + /* The first problematic code point is the first surrogate */ if (uv >= UNICODE_SURROGATE_FIRST && ckWARN3_d(WARN_SURROGATE, WARN_NON_UNICODE, WARN_NONCHAR)) @@ -130,6 +135,10 @@ Perl_uvoffuni_to_utf8_flags(pTHX_ U8 *d, UV uv, UV flags) if (flags & UNICODE_DISALLOW_SUPER || (UNICODE_IS_FE_FF(uv) && (flags & UNICODE_DISALLOW_FE_FF))) { +#ifdef EBCDIC + Perl_die(aTHX_ "Can't represent character for Ox%"UVXf" on this platform", uv); + assert(0); +#endif return NULL; } } @@ -1688,7 +1697,7 @@ Perl__is_utf8_perl_idstart(pTHX_ const U8 *p) if (! PL_utf8_perl_idstart) { invlist = _new_invlist_C_array(_Perl_IDStart_invlist); } - return is_utf8_common(p, &PL_utf8_perl_idstart, "", invlist); + return is_utf8_common(p, &PL_utf8_perl_idstart, "_Perl_IDStart", invlist); } bool @@ -1711,7 +1720,7 @@ Perl__is_utf8_perl_idcont(pTHX_ const U8 *p) if (! PL_utf8_perl_idcont) { invlist = _new_invlist_C_array(_Perl_IDCont_invlist); } - return is_utf8_common(p, &PL_utf8_perl_idcont, "", invlist); + return is_utf8_common(p, &PL_utf8_perl_idcont, "_Perl_IDCont", invlist); } bool diff --git a/utf8.h b/utf8.h index d3b55ee..3d29706 100644 --- a/utf8.h +++ b/utf8.h @@ -382,11 +382,11 @@ Perl's extended UTF-8 means we can have start bytes up to FF. #define UTF8_TWO_BYTE_HI(c) \ (__ASSERT_((sizeof(c) == 1) \ || !(((WIDEST_UTYPE)(c)) & ~MAX_PORTABLE_UTF8_TWO_BYTE)) \ - ((U8) __BASE_TWO_BYTE_HI(c, NATIVE_TO_LATIN1))) + ((U8) __BASE_TWO_BYTE_HI(c, NATIVE_TO_UNI))) #define UTF8_TWO_BYTE_LO(c) \ (__ASSERT_((sizeof(c) == 1) \ || !(((WIDEST_UTYPE)(c)) & ~MAX_PORTABLE_UTF8_TWO_BYTE)) \ - ((U8) __BASE_TWO_BYTE_LO(c, NATIVE_TO_LATIN1))) + ((U8) __BASE_TWO_BYTE_LO(c, NATIVE_TO_UNI))) /* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII * as it is only in overlongs. */ -- Perl5 Master Repository
