In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/eb9df7077460d67ecf8fe825ff5613ec5c34cb6e?hp=28ffebafd3403d496952bd64c99bb9bd7cbe871f>
- Log ----------------------------------------------------------------- commit eb9df7077460d67ecf8fe825ff5613ec5c34cb6e Author: Karl Williamson <[email protected]> Date: Wed Jan 14 13:45:40 2015 -0700 Add text about EBCDIC to pods: perlhack* perlport ----------------------------------------------------------------------- Summary of changes: pod/perlhack.pod | 54 +++++++++++++++++++++++++++++++++++++++++++++++++--- pod/perlhacktips.pod | 34 +++++++++++++++++++++++---------- pod/perlport.pod | 21 +++++++++++++------- 3 files changed, 89 insertions(+), 20 deletions(-) diff --git a/pod/perlhack.pod b/pod/perlhack.pod index b966542..23620a3 100644 --- a/pod/perlhack.pod +++ b/pod/perlhack.pod @@ -38,7 +38,10 @@ latest version directly from the perl source: =item * Make your change -Hack, hack, hack. +Hack, hack, hack. Keep in mind that Perl runs on many different +platforms, with different operating systems that have different +capabilities, different filesystem organizations, and even different +character sets. L<perlhacktips> gives advice on this. =item * Test your change @@ -774,8 +777,53 @@ contains the test. This causes some problems with the tests in F<lib/>, so here's some opportunity for some patching. You must be triply conscious of cross-platform concerns. This usually -boils down to using L<File::Spec> and avoiding things like C<fork()> -and C<system()> unless absolutely necessary. +boils down to using L<File::Spec>, avoiding things like C<fork()> +and C<system()> unless absolutely necessary, and not assuming that a +given character has a particular ordinal value (code point) or that its +UTF-8 representation is composed of particular bytes. + +There are several functions available to specify characters and code +points portably in tests. The always-preloaded functions +C<utf8::unicode_to_native()> and its inverse +C<utf8::native_to_unicode()> take code points and translate +appropriately. The file F<t/charset_tools.pl> has several functions +that can be useful. It has versions of the previous two functions +that take strings as inputs -- not single numeric code points: +C<uni_to_native()> and C<native_to_uni()>. If you must look at the +individual bytes comprising a UTF-8 encoded string, +C<byte_utf8a_to_utf8n()> takes as input a string of those bytes encoded +for an ASCII platform, and returns the equivalent string in the native +platform. For example, C<byte_utf8a_to_utf8n("\xC2\xA0")> returns the +byte sequence on the current platform that form the UTF-8 for C<U+00A0>, +since C<"\xC2\xA0"> are the UTF-8 bytes on an ASCII platform for that +code point. This function returns C<"\xC2\xA0"> on an ASCII platform, and +C<"\x80\x41"> on an EBCDIC 1047 one. + +But easiest is to use C<\N{}> to specify characters, if the side effects +aren't troublesome. Simply specify all your characters in hex, using +C<\N{U+ZZ}> instead of C<\xZZ>. C<\N{}> is the Unicode name, and so it +always gives you the Unicode character. C<\N{U+41}> is the character +whose Unicode code point is C<0x41>, hence is C<'A'> on all platforms. +The side effects are: + +=over 4 + +=item 1) + +These select Unicode rules. That means that in double-quotish strings, +the string is always converted to UTF-8 to force a Unicode +interpretation (you can C<utf8::downgrade()> afterwards to convert back +to non-UTF8, if possible). In regular expression patterns, the +conversion isn't done, but if the character set modifier would +otherwise be C</d>, it is changed to C</u>. + +=item 2) + +If you use the form C<\N{I<character name>}>, the L<charnames> module +gets automatically loaded. This may not be suitable for the test level +you are doing. + +=back =head2 Special C<make test> targets diff --git a/pod/perlhacktips.pod b/pod/perlhacktips.pod index 40bb3a1..943bdfb 100644 --- a/pod/perlhacktips.pod +++ b/pod/perlhacktips.pod @@ -143,7 +143,7 @@ as many as possible of the C<-std=c89>, C<-ansi>, C<-pedantic>, and a selection of C<-W> flags (see cflags.SH). Also study L<perlport> carefully to avoid any bad assumptions about the -operating system, filesystems, and so forth. +operating system, filesystems, character set, and so forth. You may once in a while try a "make microperl" to see whether we can still compile Perl with just the bare minimum of interfaces. (See @@ -275,12 +275,26 @@ Assuming the character set is ASCIIish Perl can compile and run under EBCDIC platforms. See L<perlebcdic>. This is transparent for the most part, but because the character sets differ, you shouldn't use numeric (decimal, octal, nor hex) constants -to refer to characters. You can safely say 'A', but not 0x41. You can -safely say '\n', but not \012. If a character doesn't have a trivial -input form, you should add it to the list in -F<regen/unicode_constants.pl>, and have Perl create #defines for you, +to refer to characters. You can safely say C<'A'>, but not C<0x41>. +You can safely say C<'\n'>, but not C<\012>. However, you can use +macros defined in F<utf8.h> to specify any code point portably. +C<LATIN1_TO_NATIVE(0xDF)> is going to be the code point that means +LATIN SMALL LETTER SHARP S on whatever platform you are running on (on +ASCII platforms it compiles without adding any extra code, so there is +zero performance hit on those). The acceptable inputs to +C<LATIN1_TO_NATIVE> are from C<0x00> through C<0xFF>. If your input +isn't guaranteed to be in that range, use C<UNICODE_TO_NATIVE> instead. +C<NATIVE_TO_LATIN1> and C<NATIVE_TO_UNICODE> translate the opposite +direction. + +If you need the string representation of a character that doesn't have a +mnemonic name in C, you should add it to the list in +F<regen/unicode_constants.pl>, and have Perl create C<#define>s for you, based on the current platform. +Note that the C<isI<FOO>> and C<toI<FOO>> macros in F<handy.h> work +properly on native code points and strings. + Also, the range 'A' - 'Z' in ASCII is an unbroken sequence of 26 upper case alphabetic characters. That is not true in EBCDIC. Nor for 'a' to 'z'. But '0' - '9' is an unbroken range in both systems. Don't assume @@ -293,11 +307,11 @@ able to handle EBCDIC without having to change pre-existing code. UTF-8 and UTF-EBCDIC are two different encodings used to represent Unicode code points as sequences of bytes. Macros with the same names -(but different definitions) in C<utf8.h> and C<utfebcdic.h> are used to +(but different definitions) in F<utf8.h> and F<utfebcdic.h> are used to allow the calling code to think that there is only one such encoding. This is almost always referred to as C<utf8>, but it means the EBCDIC version as well. Again, comments in the code may well be wrong even if -the code itself is right. For example, the concept of C<invariant +the code itself is right. For example, the concept of UTF-8 C<invariant characters> differs between ASCII and EBCDIC. On ASCII platforms, only characters that do not have the high-order bit set (i.e. whose ordinals are strict ASCII, 0 - 127) are invariant, and the documentation and @@ -314,9 +328,9 @@ Assuming the character set is just ASCII ASCII is a 7 bit encoding, but bytes have 8 bits in them. The 128 extra characters have different meanings depending on the locale. Absent a locale, currently these extra characters are generally considered to be -unassigned, and this has presented some problems. This is being changed -starting in 5.12 so that these characters will be considered to be -Latin-1 (ISO-8859-1). +unassigned, and this has presented some problems. This has being +changed starting in 5.12 so that these characters can be considered to +be Latin-1 (ISO-8859-1). =item * diff --git a/pod/perlport.pod b/pod/perlport.pod index a58ab15..3bb10e3 100644 --- a/pod/perlport.pod +++ b/pod/perlport.pod @@ -88,7 +88,8 @@ and S<Mac OS> uses C<\015>. Perl uses C<\n> to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, C<\n> always -means C<\015>. In DOSish perls, C<\n> usually means C<\012>, but when +means C<\015>. On EBCDIC platforms, C<\n> could be C<\025> or C<\045>. +In DOSish perls, C<\n> usually means C<\012>, but when accessing a file in "text" mode, perl uses the C<:crlf> layer that translates it to (or from) C<\015\012>, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical @@ -646,11 +647,16 @@ value to get what should be the proper value on any system. Assume very little about character sets. Assume nothing about numerical values (C<ord>, C<chr>) of characters. -Do not use explicit code point ranges (like \xHH-\xHH); use for -example symbolic character classes like C<[:print:]>. +Do not use explicit code point ranges (like C<\xHH-\xHH)>. However, +starting in Perl v5.22, regular expression pattern bracketed character +class ranges specified like C<qr/[\N{U+HH}-\N{U+HH}]/> are portable. +You can portably use, for example, symbolic character classes like +C<[:print:]>. Do not assume that the alphabetic characters are encoded contiguously -(in the numeric sense). There may be gaps. +(in the numeric sense). There may be gaps. Special coding in Perl, +however, guarantees that all subsets of C<qr/[A-Z]/>, C<qr/[a-z]/>, and +C<qr/[0-9]/> behave as expected. Do not assume anything about the ordering of the characters. The lowercase letters may come before or after the uppercase letters; @@ -677,10 +683,11 @@ code under a UTF-8 locale, in which case random native bytes might be illegal ("Malformed UTF-8 ...") This means that for example embedding ISO 8859-1 bytes beyond 0x7f into your strings might cause trouble later. If the bytes are native 8-bit bytes, you can use the C<bytes> -pragma. If the bytes are in a string (regular expression being a -curious string), you can often also use the C<\xHH> notation instead +pragma. If the bytes are in a string (regular expressions being +curious strings), you can often also use the C<\xHH> or more portably, +the C<\N{U+HH}> notations instead of embedding the bytes as-is. If you want to write your code in UTF-8, -you can use the C<utf8>. +you can use L<utf8>. =head2 System Resources -- Perl5 Master Repository
