patrick hall wrote:

Unicode can get pretty hairy, but it's my
impression that the number of bytes per character
varies depending on your encoding. UTF-8, the
defacto standard nowadays, has variable length
encoding -- characters can take between 3 and 6
bytes, if I recall correctly.
[...]
Sure enough, Perl counts characters, not bytes,
with Unicode text.

In my post from 2003-08-26 Re: Length() is bits/bytes or neither (Message-ID: <[EMAIL PROTECTED]>) I wrote some code illustrating this issue but without any comments. Anyway, I wrote a function bytes() which counts bytes in UTF-8 strings:


   #!/usr/bin/perl -wl
   use utf8;
   sub bytes ($) {
       use bytes;
       length $_[0];
   }
   $x = chr 123456;
   print length $x, " chars";
   print bytes $x, " bytes";

The "use utf8;" is actually quite redundant since perl will store chr(123456) as UTF-8 anyway. This is only one character, so length($x) returns 1. But inside the bytes() subroutine there's "use bytes;" which forces byte semantics in this lexical scope (i.e. inside the body of this subroutine) so the length() here returns the bytes instead of characters number and this value is returned by the bytes() subroutine. So the bytes() sub uses length() just like the main program, but here it means something different, thanks to "bytes" pragma, it counts bytes.

--
ZSDC Perl and Systems Security Consulting


-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to