On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease <digitalx...@gmail.com> wrote: > On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg <m...@egenix.com> wrote: >> I also don't see how this could save a lot of memory. As an example >> take a French text with say 10mio code points. This would end up >> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), >> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending >> on how many accents are used). That's a saving of -10MB compared to >> today's implementation :-) > > If I am reading the pep right, which I may not be as I am no expert on > unicode, the new implementation would actually give a 10MB saving > since the wchar field is optional, so only the str (Latin-1) and utf8 > fields would need to be stored. How it decides not to store one field > or another would need to be clarified in the pep is I am right.
The PEP actually does define that already: PyUnicode_AsUTF8 populates the utf8 field of the existing string, while PyUnicode_AsUTF8String creates a *new* string with that field populated. PyUnicode_AsUnicode will populate the wstr field (but doing so generally shouldn't be necessary). For a UCS4 build, my reading of the PEP puts the memory savings for a 100 code point string as follows: Current size: 400 bytes (regardless of max code point) New initial size (max code point < 256): 100 bytes (75% saving) New initial size (max code point < 65536): 200 bytes (50% saving) New initial size (max code point >= 65536): 400 bytes (no saving) For each of the "new" strings, they may consume additional storage if the utf8 or wstr fields get populated. The maximum possible size would be a UCS4 string (max code point >= 65536) on a sizeof(wchar_t) == 2 system with the utf8 string populated. In such cases, you would consume at least 700 bytes, plus whatever additional memory is needed to encode the non-BMP characters into UTF-8 and UTF-16. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com