For the record: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending > on how many accents are used).
Typical French text seems to have 5% non-ASCII characters. So the number of UTF-8 bytes needed to represent a French text would only be 5% higher than the number of code points. Anyway, it's quite obvious that Martin's goal is that only one representation gets created most of the time. To quote the draft: “All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.” Regards Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com