The question should be: what I've been thinking? OK. Here is another try. The first byte dictates how many bytes there are: first & 0x80 == 0 => one byte first & 0xE0 == 0xC0 => two bytes first & 0xF0 == 0xE0 => three bytes first & 0xF8 == 0xF0 => four bytes
And for every other byte in the character: byte & 0xC0 == 0x80 For your example, the first byte starts with an 'E', meaning three bytes. the rest of the bytes starts with '8' and 'A' - OK. Shmuel. Gaal Yahas wrote: > Are you sure? €, U+20AC is represented in UTF-8 as 0xE2, 0x82, 0xAC. > The middle byte & 0xC0 == 0. > > On Sun, Jan 18, 2009 at 7:26 PM, Shmuel Fomberg <[email protected]> wrote: >> Hi. >> >> I've been reading a bit about utf8, and I learned that when reading a >> utf8 character, for each byte I need to check: >> (byte & 0xC0 ) == 0xC0 >> means that there is another byte for this character. Otherwise, it's the >> last byte of the character. >> >> Shmuel. _______________________________________________ Perl mailing list [email protected] http://perl.org.il/mailman/listinfo/perl
