In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/257844b92d3c10f7e307f63208d703bf01f713fb?hp=310a0d0a370ea21c004bfea4bbdd2cf15da94547>
- Log ----------------------------------------------------------------- commit 257844b92d3c10f7e307f63208d703bf01f713fb Author: Karl Williamson <[email protected]> Date: Sun Apr 26 09:37:39 2015 -0600 perlhacktips: Add character set portability tip M pod/perlhacktips.pod commit c22aa07d4f20e1eb10134bbc6f87b22c36b55512 Author: Karl Williamson <[email protected]> Date: Sun Apr 26 09:37:11 2015 -0600 perlhacktips: Nit, clarification M pod/perlhacktips.pod ----------------------------------------------------------------------- Summary of changes: pod/perlhacktips.pod | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/pod/perlhacktips.pod b/pod/perlhacktips.pod index 6d7a098..f7cd08d 100644 --- a/pod/perlhacktips.pod +++ b/pod/perlhacktips.pod @@ -289,7 +289,7 @@ direction. If you need the string representation of a character that doesn't have a mnemonic name in C, you should add it to the list in -F<regen/unicode_constants.pl>, and have Perl create C<#define>s for you, +F<regen/unicode_constants.pl>, and have Perl create C<#define>'s for you, based on the current platform. Note that the C<isI<FOO>> and C<toI<FOO>> macros in F<handy.h> work @@ -298,7 +298,9 @@ properly on native code points and strings. Also, the range 'A' - 'Z' in ASCII is an unbroken sequence of 26 upper case alphabetic characters. That is not true in EBCDIC. Nor for 'a' to 'z'. But '0' - '9' is an unbroken range in both systems. Don't assume -anything about other ranges. +anything about other ranges. (Note that special handling of ranges in +regular expression patterns makes it appear to Perl +code that the aforementioned ranges are all unbroken.) Many of the comments in the existing code ignore the possibility of EBCDIC, and may be wrong therefore, even if the code works. This is @@ -321,6 +323,31 @@ EBCDIC machines, but as long as the code itself uses the C<NATIVE_IS_INVARIANT()> macro appropriately, it works, even if the comments are wrong. +As noted in L<perlhack/TESTING>, when writing test scripts, the file +F<t/charset_tools.pl> contains some helpful functions for writing tests +valid on both ASCII and EBCDIC platforms. Sometimes, though, a test +can't use a function and it's inconvenient to have different test +versions depending on the platform. There are 20 code points that are +the same in all 4 character sets currently recognized by Perl (the 3 +EBCDIC code pages plus ISO 8859-1 (ASCII/Latin1)). These can be used in +such tests, though there is a small possibility that Perl will become +available in yet another character set, breaking your test. All but one +of these code points are C0 control characters. The most significant +controls that are the same are C<\0>, C<\r>, and C<\N{VT}> (also +specifiable as C<\cK>, C<\x0B>, C<\N{U+0B}>, or C<\013>). The single +non-control is U+00B6 PILCROW SIGN. The controls that are the same have +the same bit pattern in all 4 character sets, regardless of the UTF8ness +of the string containing them. The bit pattern for U+B6 is the same in +all 4 for non-UTF8 strings, but differs in each when its containing +string is UTF-8 encoded. The only other code points that have some sort +of sameness across all 4 character sets are the pair 0xDC and 0xFC. +Together these represent upper- and lowercase LATIN LETTER U WITH +DIAERESIS, but which is upper and which is lower may be reversed: 0xDC +is the capital in Latin1 and 0xFC is the small letter, while 0xFC is the +capital in EBCDIC and 0xDC is the small one. This factoid may be +exploited in writing case insensitive tests that are the same across all +4 character sets. + =item * Assuming the character set is just ASCII -- Perl5 Master Repository
