On Nov 24, 2007 12:06 AM, Ken Perl <[EMAIL PROTECTED]> wrote:
> I use following piece of code to write smiley Unicode  string into a file,
>
> use Encode;
> my $smiley = "\x{263a}";
> open my $out, ">:utf8", "file" or die "$!";
> print $out $smiley;
>
> however, if we dump the output file in binary mode, the hex looks
> wrong, it isn't 263a, any idea what's wrong with the code?
>
> 0000000: e298 ba                                  ...
>
>
> --
> perl -e 'print unpack(u,"62V5N\"FME;G\!E<FQ`9VUA:6PN8V]M\"[EMAIL PROTECTED]
> ")'
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> http://learn.perl.org/
>
>
>

There is a difference between UTF-8 and Unicode characters.  UTF-8 is
a method of encoding Unicode characters. So the UTF-8 encoded version
of a character will not necessarily be the same bits as the bits in
the same Unicode character.

from http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
UTF-8 has the following properties:
    * UCS characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 0x00 to 0x7F (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same
encoding under both ASCII and UTF-8.
    * All UCS characters >U+007F are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore, no
ASCII byte (0x00-0x7F) can appear as part of any other character.
    * The first byte of a multibyte sequence that represents a
non-ASCII character is always in the range 0xC0 to 0xFD and it
indicates how many bytes follow for this character. All further bytes
in a multibyte sequence are in the range 0x80 to 0xBF. This allows
easy resynchronization and makes the encoding stateless and robust
against missing bytes.
    * All possible 231 UCS codes can be encoded.
    * UTF-8 encoded characters may theoretically be up to six bytes
long, however 16-bit BMP characters are only up to three bytes long.
    * The sorting order of Bigendian UCS-4 byte strings is preserved.
    * The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The following byte sequences are used to represent a character. The
sequence to be used depends on the Unicode number of the character:

U-00000000 – U-0000007F:        0xxxxxxx
U-00000080 – U-000007FF:        110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF:        111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF:        1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 
10xxxxxx

You were trying to write U+263a and the bytes in the file were e2, 98,
and ba.  Lets see if that is correct.

The Unicode character is higher than 127, so we can ignore the first rule.

The next rule states that every byte that does not represent a
character between 0 and 127 will have the highest bit set. e2 is
11100010, 98 is 10011000, and ba is 10111010.  So they all have their
highest bit set.

The third states that is e2 should be between c0 and fd (it is) and
that all the other bytes should be between 80 and bf (ba cuts it
close, but yes they all are)

Okay, so lets pull the Unicode character out of the UTF-8 stream.  The
first four bits are 1110 so the Unicode character will fall between
U-0800 – U-FFFF.  That means the first few bits are 0010.  Next we
need to strip off the leading 10s from the next two bytes: 011000 and
111010.  Put them all together and we get 00100110 00111010 or U+263a
(smiley).

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to