On Thu, 12 Jul 2012 15:40:55 -0500 Andy Bach <afb...@gmail.com> wrote:
> On Thu, Jul 12, 2012 at 3:25 PM, Manfred Lotz <manfred.l...@arcor.de> > wrote: > > This is really nice. I fumbled with unpack before but have to admit > > that I didn't know about 'use bytes' which is the key. > > Couple interesting links, unpack in painful detail: > http://www.perlmonks.org/?node_id=224666 > > and utf-8 and "use bytes" info: > http://perldoc.perl.org/perluniintro.html (search for 'use bytes' and > look around) > great, thx > because > use bytes; > > affects the whole script, you want to finish w/ > no bytes; But if I'm using it in a subroutine then 'use bytes' is only effective in the scope of the subroutine, isn't it? That is how I would use it preferably. > How Do I Know Whether My String Is In Unicode? > > You shouldn't have to care. But you may if your Perl is before 5.14.0 > or you haven't specified use feature 'unicode_strings' or use 5.012 > (or higher) because otherwise the semantics of the code points in the > range 128 to 255 are different depending on whether the string they > are contained within is in Unicode or not. (See When Unicode Does Not > Happen in perlunicode.) > > To determine if a string is in Unicode, use: > > print utf8::is_utf8($string) ? 1 : 0, "\n"; > > But note that this doesn't mean that any of the characters in the > string are necessary UTF-8 encoded, or that any of the characters have > code points greater than 0xFF (255) or even 0x80 (128), or that the > string has any characters at all. All the is_utf8() does is to return > the value of the internal "utf8ness" flag attached to the $string . If > the flag is off, the bytes in the scalar are interpreted as a single > byte encoding. If the flag is on, the bytes in the scalar are > interpreted as the (variable-length, potentially multi-byte) UTF-8 > encoded code points of the characters. Bytes added to a UTF-8 encoded > string are automatically upgraded to UTF-8. If mixed non-UTF-8 and > UTF-8 scalars are merged (double-quoted interpolation, explicit > concatenation, or printf/sprintf parameter substitution), the result > will be UTF-8 encoded as if copies of the byte strings were upgraded > to UTF-8: for example, > > $a = "ab\x80c"; > $b = "\x{100}"; > print "$a = $b\n"; > > the output string will be UTF-8-encoded ab\x80c = \x{100}\n , but $a > will stay byte-encoded. > > Sometimes you might really need to know the byte length of a string > instead of the character length. For that use either the > Encode::encode_utf8() function or the bytes pragma and the length() > function: > > my $unicode = chr(0x100); > print length($unicode), "\n"; # will print 1 > require Encode; > print length(Encode::encode_utf8($unicode)),"\n"; # will print 2 > use bytes; > print length($unicode), "\n"; # will also print 2 > # (the 0xC4 0x80 of the UTF-8) > no bytes; Thanks a lot for you detailed explanations. Will study it thoroughly. -- Manfred -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/