willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from?
Where it's been is irrelevant. Where it's going to is what matters. > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > # 11100010 10011100 10001100 > buf = "\xE2\x9C\x8C" > > u = buf.decode('UTF-8') > > # ... later ... > > u.bytes() -> 3 > > (goes through each code point and calculates > the number of bytes that make up the character > according to the encoding) Suppose the unicode object was decoded using some encoding other than the one that's going to be used to store the info in the database: | >>> sg = '\xc9\xb5\xb9\xcf' | >>> len(sg) | 4 | >>> u = sg.decode('gb2312') later: u.bytes() => 4 but | >>> len(u.encode('utf8')) | 6 and by the way, what about the memory overhead of storing the name of the encoding (in the above case 7 (6 + overhead))? What would u"abcdef".bytes() produce? An exception? HTH, John -- http://mail.python.org/mailman/listinfo/python-list