Nick Coghlan schrieb: > That way the internal representation of a string would only need to grow > one extra field (the one saying how many bytes there are per character), > and the internal state would remain immutable.
You could play tricks with ob_size to save this field: - ob_size < 0: 8-bit data; length is abs(ob_size) - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 The first representation constrains the length of an 8-bit representation to max_ssize_t, which is also the limit today. For 16-bit strings, the limit is max_ssize_t/2, which means max_ssize_t bytes; this is technically more constraining, but such a string would still consume half of the address space, and is unlikely to get created (*). For 32-bit strings, the limit is also max_ssize_t/2, yet the maximum string would require more than 2*max_ssize_t (==max_size_t) bytes, so this isn't a real limitation. > For 8-bit source data, 'latin-1' would then be the most efficient > encoding, in that it would be a simple memcpy from the bytes object's > internal buffer to the string object's internal buffer. Other encodings > like 'koi8-r' would be decoded to either latin-1, UCS-2 or UCS-4 > depending on the largest code point in the source data. This might somewhat slow-down codecs, which would have to scan the input string first to find out what the maximum code point is, where they currently can decode in a single pass. Of course, for multi-byte codecs, such scanning is a good idea, anyway (some currently overallocate just to avoid the second pass). Regards, Martin (*) Many systems don't allow such large memory blocks,anyway. E.g. on 32-bit Windows, in the standard configuration, the address space is "only" 2GB. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com