Yes, that's true. The encoding is also is self-synchronizing, and if there's one bad character somewhere, you only have to move a little bit forward in the stream to get good data again. Also, many C library functions that were designed for ASCII continue to work with UTF-8 without change. It's a nifty encoding.
(Heh, my reply earlier contained an error too; I meant to say middle byte & 0xC0 != 0xC0, not that it == 0.) On Sun, Jan 18, 2009 at 8:56 PM, Shmuel Fomberg <[email protected]> wrote: > > The question should be: what I've been thinking? > OK. Here is another try. > > The first byte dictates how many bytes there are: > first & 0x80 == 0 => one byte > first & 0xE0 == 0xC0 => two bytes > first & 0xF0 == 0xE0 => three bytes > first & 0xF8 == 0xF0 => four bytes > > And for every other byte in the character: > byte & 0xC0 == 0x80 > > For your example, the first byte starts with an 'E', meaning three > bytes. the rest of the bytes starts with '8' and 'A' - OK. > > Shmuel. > > Gaal Yahas wrote: >> Are you sure? €, U+20AC is represented in UTF-8 as 0xE2, 0x82, 0xAC. >> The middle byte & 0xC0 == 0. >> >> On Sun, Jan 18, 2009 at 7:26 PM, Shmuel Fomberg <[email protected]> wrote: >>> Hi. >>> >>> I've been reading a bit about utf8, and I learned that when reading a >>> utf8 character, for each byte I need to check: >>> (byte & 0xC0 ) == 0xC0 >>> means that there is another byte for this character. Otherwise, it's the >>> last byte of the character. >>> >>> Shmuel. > > _______________________________________________ > Perl mailing list > [email protected] > http://perl.org.il/mailman/listinfo/perl > -- Gaal Yahas <[email protected]> http://gaal.livejournal.com/ _______________________________________________ Perl mailing list [email protected] http://perl.org.il/mailman/listinfo/perl
