(beating a dead horse) Is it too ridiculous to suggest that it'd be nice if the unicode object were to remember the encoding of the string it was decoded from? So that it's feasible to calculate the number of bytes that make up the unicode code points.
# U+270C # 11100010 10011100 10001100 buf = "\xE2\x9C\x8C" u = buf.decode('UTF-8') # ... later ... u.bytes() -> 3 (goes through each code point and calculates the number of bytes that make up the character according to the encoding) -- http://mail.python.org/mailman/listinfo/python-list