* Dan Muey <d...@cpanel.net> [2010-10-28 21:55]: > For example, note the differences in output between a unicode > string and a byte string regarding character 257, as a unicode > string it is 257, as a byte string it is 196.
That is not what’s going on. $ perl -E'say ord "1234"' 49 When you pass a multi-character string to `ord`, you get the code point of the first character. $ perl -E'say chr 49' 1 In your case you get 196. That is 0xC4, or the character Ä. It is not the character ā (U+101 = code point 257). 0xC4 is the value of the first byte in the two-byte UTF-8 sequence that encodes the character 257. You are passing a string containing a representation of those bytes as two characters to `ord`, and `ord` is giving you the code point of the first byte-as-character. You are missing the rest of the bytes from the UTF-8 encoding. You are losing data. If you try this on more code points you will find that there are *lots* of different characters that are reported as 196 – because they get encoded as multi-byte sequences that all start with the byte value 0xC4. -- *AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined wantarray]/e;chop;$_} &Just->another->Perl->hack; #Aristotle Pagaltzis // <http://plasmasturm.org/>