Jim Jewett schrieb: >>> Interning may get awkward if multiple encodings are allowed within a >>> program, regardless of whether they're allowed for single strings. It >>> might make sense to intern only strings that are in the same encoding >>> as the source code. (Or whose values are limited to ASCII?) > >> Why? If the text hash function is defined on *code points*, then >> interning, or really any arbitrary dictionary lookup is the same as it >> has always been. > > The problem isn't the hash; it is the equality. Which encoding do you > keep interned?
Are you using the verb "to intern" here in the sense of the intern() builtin()? If so: intern the representation of the string that gets interned first. Python currently interns the entire string object (not just the character data); I see no reason to change that. >> What about never recoding? The benefit of the latin-1/ucs-2/ucs-4 >> method I previously described is that each of the encodings offer a >> minimal representation of the code points that the text object contains. > > There may be some thrashing as > > s+= (larger char) This creates a new string, in any case (remember, strings are not mutable). Assume the old value A was ucs-2, and (larger char) B is ucs-4. The code to perform the addition would be C = PyString_New(A->ob_size+b->ob_size, UCS4); UCS2 *a_data = PyString_AsUCS2(A); UCS4 *b_data = PyString_AsUCS4(B); UCS4 *c_data = PyString_AsUCS4(); for(int k=0; k < A->ob_size; k++) *c_data++ = *a_data++; for(int k = 0; k < B->ob_size; k++) *c_data++ = *b_data; *c_data = 0; Notice that this code is independent from whether A and B have different representations or not. > s[:6] This would require two iterations over the string: one to find the maximum character, and the second to perform the actual copying. > The three options might well be a sensible choice, but I think it > would already have much of the disadvantage of multiple internal > encodings, and we might eventually regret any specific limits. (Why > not the local 8-bit? Why not UTF-8, if that is the system encoding?) Take a look at above code. Here, I never invoke a codec routine (which would be quite expensive). Instead, I rely on the fact that the characters have the same numeric values in all three representations. > It is easy enough to answer why not for each specific case, but I'm > not *certain* that it is the right answer -- so why not leave it up to > implementors if they want to do more than the basic three? Not sure what implementors you are talking about: anybody who wants to clone Python is free to do whatever they want. We *are* the implementors of CPython, and if we don't want to do more, then we just don't want it. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com