"Jim Jewett" <[EMAIL PROTECTED]> wrote: > Interning may get awkward if multiple encodings are allowed within a > program, regardless of whether they're allowed for single strings. It > might make sense to intern only strings that are in the same encoding > as the source code. (Or whose values are limited to ASCII?)
Why? If the text hash function is defined on *code points*, then interning, or really any arbitrary dictionary lookup is the same as it has always been. > There should be only one reference to a string until is constructed, > and after that, its data should be immutable. Recoding that results > in different bytes should not be in-place. Either it returns a new > string (no problem) or it doesn't change the databuffer-and-encoding > pointer until the new databuffer is fully constructed. What about never recoding? The benefit of the latin-1/ucs-2/ucs-4 method I previously described is that each of the encodings offer a minimal representation of the code points that the text object contains. Certain operations would require a bit of work to handle the comparison of code points stored in an x-bit-wide representation with code points stored in a y-bit-wide representation. > So adding boilerplate to treat text as bytes "for efficiency" may > become a standard recipe? Not so good. Presumably there is going to be a mechanism to open files as bytes (reads return bytes), and for things like web servers, file servers, etc., serving the content up as just a bunch of bytes is really the only thing that makes sense. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
