Round Trip in UTF-8: Two Byte Encodings Explanation:
Latin-1 includes all of 127 7-bit ASCII and begins the 2-byte encodings in UTF-8. Below, Latin-1 character at code point 200 is represented as bytes then broken down into bits. When utf-8 needs two bytes (it might use up to six), the leading byte begins 110 to signify that, and the followup byte begins 10. So the significant payload bits are just 11001000 which Python shows is 200, where we started. http://youtu.be/vLBtrd9Ar28 (see chart @ 10:30) Leading byte begins 0xxxxxxx - this is the only byte 110xxxxx - another byte after this 1110xxxx - two more bytes 11110xxx - three more bytes 111110xx - four more bytes 1111110x - a total of six bytes x means 'room for payload' Console session: Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:09:56) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin sys.path.extend(['/Users/kurner/Documents']) >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>> chr(200) 'È' >>> bytes(chr(200), encoding='utf-8') b'\xc3\x88' >>> bin(0xc3) # left byte in bits '0b11000011' >>> bin(0x88) # right byte in bits '0b10001000' >>> 0b11001000 # the encoded number 200
_______________________________________________ Edu-sig mailing list Edu-sig@python.org https://mail.python.org/mailman/listinfo/edu-sig