Re: Unicode question

Manfred Lotz Fri, 13 Jul 2012 11:27:17 -0700

On Thu, 12 Jul 2012 15:40:55 -0500
Andy Bach <afb...@gmail.com> wrote:


> On Thu, Jul 12, 2012 at 3:25 PM, Manfred Lotz <manfred.l...@arcor.de>
> wrote:
> > This is really nice. I fumbled with unpack before but have to admit
> > that I didn't know about 'use bytes' which is the key.
> 
> Couple interesting links, unpack in painful detail:
> http://www.perlmonks.org/?node_id=224666
> 
> and utf-8 and "use bytes" info:
> http://perldoc.perl.org/perluniintro.html  (search for 'use bytes' and
> look around)
> 

great, thx


> because
> use bytes;
> 
> affects the whole script, you want to finish w/
> no bytes;

But if I'm using it in a subroutine then 'use bytes' is only effective
in the scope of the subroutine, isn't it? That is how I would use it
preferably.



> How Do I Know Whether My String Is In Unicode?
> 
> You shouldn't have to care. But you may if your Perl is before 5.14.0
> or you haven't specified use feature 'unicode_strings' or use 5.012
> (or higher) because otherwise the semantics of the code points in the
> range 128 to 255 are different depending on whether the string they
> are contained within is in Unicode or not. (See When Unicode Does Not
> Happen in perlunicode.)
> 
> To determine if a string is in Unicode, use:
> 
>     print utf8::is_utf8($string) ? 1 : 0, "\n";
> 
> But note that this doesn't mean that any of the characters in the
> string are necessary UTF-8 encoded, or that any of the characters have
> code points greater than 0xFF (255) or even 0x80 (128), or that the
> string has any characters at all. All the is_utf8() does is to return
> the value of the internal "utf8ness" flag attached to the $string . If
> the flag is off, the bytes in the scalar are interpreted as a single
> byte encoding. If the flag is on, the bytes in the scalar are
> interpreted as the (variable-length, potentially multi-byte) UTF-8
> encoded code points of the characters. Bytes added to a UTF-8 encoded
> string are automatically upgraded to UTF-8. If mixed non-UTF-8 and
> UTF-8 scalars are merged (double-quoted interpolation, explicit
> concatenation, or printf/sprintf parameter substitution), the result
> will be UTF-8 encoded as if copies of the byte strings were upgraded
> to UTF-8: for example,
> 
>     $a = "ab\x80c";
>     $b = "\x{100}";
>     print "$a = $b\n";
> 
> the output string will be UTF-8-encoded ab\x80c = \x{100}\n , but $a
> will stay byte-encoded.
> 
> Sometimes you might really need to know the byte length of a string
> instead of the character length. For that use either the
> Encode::encode_utf8() function or the bytes pragma and the length()
> function:
> 
>     my $unicode = chr(0x100);
>     print length($unicode), "\n"; # will print 1
>     require Encode;
>     print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
>     use bytes;
>     print length($unicode), "\n"; # will also print 2
>     # (the 0xC4 0x80 of the UTF-8)
>     no bytes;

Thanks a lot for you detailed explanations. Will study it thoroughly.


-- 
Manfred



-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Unicode question

Reply via email to