On 7/27/05, Moshe Kaminsky <[EMAIL PROTECTED]> wrote: > Hi, > * Fernando Canizo <[EMAIL PROTECTED]> [27/07/05 14:14]: <snip> > > I investigate what was in the archives, so i saved a copy (using 'C' > > command from mutt) of the first message (the one i receive from me) > > and file says: 'UTF-8 Unicode mail text', check what's inside with > > hexedit and see that LATIN SMALL LETTER A WITH ACUTE is encoded with > > this hex: C3 A1 (which is not 00 E1 from unicode chart from > > http://www.unicode.org/charts/) > > I think this is just the way these characters are represented in utf-8.
Yes, it is. 00E1 hex is '0000000 11100001' in binary. When encoding this as UTF-8 this value is stored in two bytes. The last byte will begin with '10' followed by the last 6 bits of data. '10 100001' binary or 'A1' in hex. The first byte will begin with '110' to indicate that it is a two byte character followed by the remaining significant data. '110 00011' binary or 'C3' hex. This is correct. The problem seem to be that mutt(?) takes this UTF-8 encoded data and encodes as UTF-8 again as if the data was two 8 bit characters. 'C3' then becomes 'C3 83' and 'A1' becomes 'C2 A1' /Andreas -- gentoo-user@gentoo.org mailing list