On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: > > So, the UTF-16 UTF-32 is INTERNAL only, for Python
NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are encodings for the EXTERNAL representation of Unicode characters in byte streams. > I also was not aware that UTF-8 chars could be up to six(6) byes long > from left to right. It could be, once upon a time in ISO faerieland, when it was thought that Unicode could grow to 2**32 codepoints. However ISO and the Unicode consortium have agreed that 17 planes is the utter max, and accordingly a valid UTF-8 byte sequence can be no longer than 4 bytes ... see below >>> chr(17 * 65536) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: chr() arg not in range(0x110000) >>> chr(17 * 65536 - 1) '\U0010ffff' >>> _.encode('utf8') b'\xf4\x8f\xbf\xbf' >>> b'\xf5\x8f\xbf\xbf'.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\python32\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0: invalid start byte -- http://mail.python.org/mailman/listinfo/python-list