In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/0397beb0d12565d70e168bfea7376e2612a6748a?hp=8dad89f0f488f99f930bb2955b564838f54816fa>
- Log ----------------------------------------------------------------- commit 0397beb0d12565d70e168bfea7376e2612a6748a Author: Tony Cook <[email protected]> Date: Mon Jul 24 11:05:40 2017 +1000 (perl #131685) improve utf8::* function documentation Splits the little cheat sheet I posted as a comment into pieces and puts them closer to where they belong - better document why you'd want to use utf8::upgrade() - similarly for utf8::downgrade() - try hard to convince people not to use utf8::is_utf8() - no, utf8::is_utf8() isn't what you want instead of utf8::valid() - change some examples to use $x instead of the sort reserved $a M lib/utf8.pm commit ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 Author: Tony Cook <[email protected]> Date: Wed Jul 19 15:42:18 2017 +1000 unfortunately sysread() tries to read characters M pod/perluniintro.pod commit 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717 Author: Tony Cook <[email protected]> Date: Wed Jul 19 10:45:33 2017 +1000 encoding.pm no longer works M pod/perlunicode.pod commit e423fa83496ce7d83b137bd7f0852864b6073b36 Author: Tony Cook <[email protected]> Date: Wed Jul 19 10:30:56 2017 +1000 use utf8; doesn't force unicode semantics on all strings in scope eg. $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"' no match perhaps this should be removed, or completely re-worded, it's worded similarly to the next point which behaves differently. M pod/perlunicode.pod ----------------------------------------------------------------------- Summary of changes: lib/utf8.pm | 71 +++++++++++++++++++++++++++++++++++++++++----------- pod/perlunicode.pod | 7 +++--- pod/perluniintro.pod | 7 ++++-- 3 files changed, 66 insertions(+), 19 deletions(-) diff --git a/lib/utf8.pm b/lib/utf8.pm index 324cb87c86..34930a0554 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -2,7 +2,7 @@ package utf8; $utf8::hint_bits = 0x00800000; -our $VERSION = '1.19'; +our $VERSION = '1.20'; sub import { $^H |= $utf8::hint_bits; @@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code. Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The logical character sequence itself is unchanged. If I<$string> is already -stored as UTF-8, then this is a no-op. Returns the -number of octets necessary to represent the string as UTF-8. Can be -used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()> -work as Unicode on strings containing non-ASCII characters whose code points -are below 256. +upgraded, then this is a no-op. Returns the +number of octets necessary to represent the string as UTF-8. + +If your code needs to be compatible with versions of perl without +C<use feature 'unicode_strings';>, you can force Unicode semantics on +a given string: + + # force unicode semantics for $string without the + # "unicode_strings" feature + utf8::upgrade($string); + +For example: + + # without explicit or implicit use feature 'unicode_strings' + my $x = "\xDF"; # LATIN SMALL LETTER SHARP S + $x =~ /ss/i; # won't match + my $y = uc($x); # won't convert + utf8::upgrade($x); + $x =~ /ss/i; # matches + my $z = uc($x); # converts to "SS" B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -136,6 +151,15 @@ true, returns false. Returns true on success. +If your code expects an octet sequence this can be used to validate +that you've received one: + + # throw an exception if not representable as octets + utf8::downgrade($string) + + # or do your own error handling + utf8::downgrade($string, 1) or die "string must be octets"; + B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that represent the individual UTF-8 bytes of the character. The UTF8 flag is turned off. Returns nothing. - my $a = "\x{100}"; # $a contains one character, with ord 0x100 - utf8::encode($a); # $a contains two characters, with ords (on + my $x = "\x{100}"; # $x contains one character, with ord 0x100 + utf8::encode($x); # $x contains two characters, with ords (on # ASCII platforms) 0xc4 and 0x80. On EBCDIC # 1047, this would instead be 0x8C and 0x41. +Similar to: + + use Encode; + $x = Encode::encode("utf8", $x); + B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -167,11 +196,11 @@ turned on only if the source string contains multiple-byte UTF-8 characters. If I<$string> is invalid as UTF-8, returns false; otherwise returns true. - my $a = "\xc4\x80"; # $a contains two characters, with ords + my $x = "\xc4\x80"; # $x contains two characters, with ords # 0xc4 and 0x80 - utf8::decode($a); # On ASCII platforms, $a contains one char, + utf8::decode($x); # On ASCII platforms, $x contains one char, # with ord 0x100. Since these bytes aren't - # legal UTF-EBCDIC, on EBCDIC platforms, $a is + # legal UTF-EBCDIC, on EBCDIC platforms, $x is # unchanged and the function returns FALSE. B<Note that this function does not handle arbitrary encodings>; @@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there. =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in -UTF-8. Functionally the same as C<Encode::is_utf8()>. +UTF-8. Functionally the same as C<Encode::is_utf8($string)>. + +Typically only necessary for debugging and testing, if you need to +dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump() +provides more detail in a compact form. + +If you still think you need this outside of debugging, testing or +dealing with filenames, you should probably read L<perlunitut> and +L<perlunifaq/What is "the UTF8 flag"?>. + +Don't use this flag as a marker to distinguish character and binary +data, that should be decided for each variable when you write your +code. + +To force unicode semantics in code portable to perl 5.8 and 5.10, call +C<utf8::upgrade($string)> unconditionally. =item * C<$flag = utf8::valid($string)> @@ -216,8 +260,7 @@ UTF-8. Functionally the same as C<Encode::is_utf8()>. UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag on B<or> if I<$string> is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's test suite to check -that operations have left strings in a consistent state. You most -probably want to use C<utf8::is_utf8()> instead. +that operations have left strings in a consistent state. =back diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ef02b0a1f5..24102bf1a9 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -60,10 +60,11 @@ filenames. Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L<open>.) -=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. -See L<encoding>. +The L<encoding> module has been deprecated since perl 5.18 and the +perl internals it requires have been removed with perl 5.26. =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts @@ -233,7 +234,7 @@ Unicode: Within the scope of S<C<use utf8>> If the whole program is Unicode (signified by using 8-bit B<U>nicode -B<T>ransformation B<F>ormat), then all strings within it must be +B<T>ransformation B<F>ormat), then all literal strings within it must be Unicode. =item * diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ad9ddab82..5e263b4e63 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed list see L<Encode::Supported>. C<read()> reads characters and returns the number of characters. -C<seek()> and C<tell()> operate on byte counts, as do C<sysread()> -and C<sysseek()>. +C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>. + +C<sysread()> and C<syswrite()> should not be used on file handles with +character encoding layers, they behave badly, and that behaviour has +been deprecated since perl 5.24. Notice that because of the default behaviour of not doing any conversion upon input if there is no default layer, -- Perl5 Master Repository
