[perl.git] branch blead, updated. v5.27.2-15-g0397beb0d1

Tony Cook Sun, 23 Jul 2017 18:41:50 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/0397beb0d12565d70e168bfea7376e2612a6748a?hp=8dad89f0f488f99f930bb2955b564838f54816fa>


- Log -----------------------------------------------------------------
commit 0397beb0d12565d70e168bfea7376e2612a6748a
Author: Tony Cook <[email protected]>
Date:   Mon Jul 24 11:05:40 2017 +1000

    (perl #131685) improve utf8::* function documentation
    
    Splits the little cheat sheet I posted as a comment into pieces
    and puts them closer to where they belong
    
    - better document why you'd want to use utf8::upgrade()
    
    - similarly for utf8::downgrade()
    
    - try hard to convince people not to use utf8::is_utf8()
    
    - no, utf8::is_utf8() isn't what you want instead of utf8::valid()
    
    - change some examples to use $x instead of the sort reserved $a

M       lib/utf8.pm

commit ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30
Author: Tony Cook <[email protected]>
Date:   Wed Jul 19 15:42:18 2017 +1000

    unfortunately sysread() tries to read characters

M       pod/perluniintro.pod

commit 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717
Author: Tony Cook <[email protected]>
Date:   Wed Jul 19 10:45:33 2017 +1000

    encoding.pm no longer works

M       pod/perlunicode.pod

commit e423fa83496ce7d83b137bd7f0852864b6073b36
Author: Tony Cook <[email protected]>
Date:   Wed Jul 19 10:30:56 2017 +1000

    use utf8; doesn't force unicode semantics on all strings in scope
    
    eg.
    
    $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
    no match
    
    perhaps this should be removed, or completely re-worded, it's worded
    similarly to the next point which behaves differently.

M       pod/perlunicode.pod
-----------------------------------------------------------------------

Summary of changes:
 lib/utf8.pm          | 71 +++++++++++++++++++++++++++++++++++++++++-----------
 pod/perlunicode.pod  |  7 +++---
 pod/perluniintro.pod |  7 ++++--
 3 files changed, 66 insertions(+), 19 deletions(-)

diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87c86..34930a0554 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
 
 $utf8::hint_bits = 0x00800000;
 
-our $VERSION = '1.19';
+our $VERSION = '1.20';
 
 sub import {
     $^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have 
UTF-8 source code.
 Converts in-place the internal representation of the string from an octet
 sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
 logical character sequence itself is unchanged.  If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8.  Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+  # force unicode semantics for $string without the
+  # "unicode_strings" feature
+  utf8::upgrade($string);
+
+For example:
+
+  # without explicit or implicit use feature 'unicode_strings'
+  my $x = "\xDF";    # LATIN SMALL LETTER SHARP S
+  $x =~ /ss/i;       # won't match
+  my $y = uc($x);    # won't convert
+  utf8::upgrade($x);
+  $x =~ /ss/i;       # matches
+  my $z = uc($x);    # converts to "SS"
 
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
 
 Returns true on success.
 
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+  # throw an exception if not representable as octets
+  utf8::downgrade($string)
+
+  # or do your own error handling
+  utf8::downgrade($string, 1) or die "string must be octets";
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that 
represent the
 individual UTF-8 bytes of the character.  The UTF8 flag is turned off.
 Returns nothing.
 
- my $a = "\x{100}"; # $a contains one character, with ord 0x100
- utf8::encode($a);  # $a contains two characters, with ords (on
+ my $x = "\x{100}"; # $x contains one character, with ord 0x100
+ utf8::encode($x);  # $x contains two characters, with ords (on
                     # ASCII platforms) 0xc4 and 0x80.  On EBCDIC
                     # 1047, this would instead be 0x8C and 0x41.
 
+Similar to:
+
+  use Encode;
+  $x = Encode::encode("utf8", $x);
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -167,11 +196,11 @@ turned on only if the source string contains 
multiple-byte UTF-8
 characters.  If I<$string> is invalid as UTF-8, returns false;
 otherwise returns true.
 
- my $a = "\xc4\x80"; # $a contains two characters, with ords
+ my $x = "\xc4\x80"; # $x contains two characters, with ords
                      # 0xc4 and 0x80
- utf8::decode($a);   # On ASCII platforms, $a contains one char,
+ utf8::decode($x);   # On ASCII platforms, $x contains one char,
                      # with ord 0x100.   Since these bytes aren't
-                     # legal UTF-EBCDIC, on EBCDIC platforms, $a is
+                     # legal UTF-EBCDIC, on EBCDIC platforms, $x is
                      # unchanged and the function returns FALSE.
 
 B<Note that this function does not handle arbitrary encodings>;
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it 
there.
 =item * C<$flag = utf8::is_utf8($string)>
 
 (Since Perl 5.8.1)  Test whether I<$string> is marked internally as encoded in
-UTF-8.  Functionally the same as C<Encode::is_utf8()>.
+UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
 
 =item * C<$flag = utf8::valid($string)>
 
@@ -216,8 +260,7 @@ UTF-8.  Functionally the same as C<Encode::is_utf8()>.
 UTF-8.  Will return true if it is well-formed UTF-8 and has the UTF-8 flag
 on B<or> if I<$string> is held as bytes (both these states are 'consistent').
 Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state.  You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
 
 =back
 
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ef02b0a1f5..24102bf1a9 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -60,10 +60,11 @@ filenames.
 Use the C<:encoding(...)> layer  to read from and write to
 filehandles using the specified encoding.  (See L<open>.)
 
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
 UTF-8.
 
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
 
 =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
 
@@ -233,7 +234,7 @@ Unicode:
 Within the scope of S<C<use utf8>>
 
 If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
 Unicode.
 
 =item *
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0ad9ddab82..5e263b4e63 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more 
detailed
 list see L<Encode::Supported>.
 
 C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
 
 Notice that because of the default behaviour of not doing any
 conversion upon input if there is no default layer,

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.27.2-15-g0397beb0d1

Reply via email to