On 5/2/06, Talin <[EMAIL PROTECTED]> wrote: > It appears that my question has been misunderstood by everyone; I'll try > to phrase it better: > > The short version is: will there be a mutable character array type? (which > I am calling "characters"?)
There are no plans for this AFAIK. > First, I do use array, not a lot but I do use it occasionally. One common > use case is equivalent to the Java StringBuffer class - that is, a means > for building up strings a character at a time, which otherwise would be > expensive to do with immutable strings. Better ways to do this might be [c]StringIO (in theory -- I don't know if it's fast enough in practice, but this should be easy to test) or the standard "".join(<list of strings>) approach (which underlies StringIO's implementation as well -- though not cStringIO's IIRC). > Now, from the discussion of "bytes" I get the impression that it, too, is > a mutable type (someone said 'like list'). Correct. (You have no excuse to guess -- the implementation is in the p3yk (sic) branch.) > So given that 'characters' (i.e. > unicode characters) are now distinct from 'bytes', it makes sense to me to > declare a mutable character array. And to me, the most natural name for > such a type is 'characters', although I suppose you could also call it > "stringbuffer" or something. I'm not sure that we need this. I'm not 100% sure we don't either, but I believe that the existing approach works pretty well -- it's hard to beat the "".join(<list>) approach. > BTW, is the internal encoding of unicode strings UTF-8, UTF-16, UCS-2, or > UTF-32? Just wondering... Again, please look at the implementation. It's an array of shortish ints; whether these are 2 or 3 or 4 bytes long depends on config constants. (Just kidding about the 3 bytes version. :-) I think that makes it UTF-16 or UTF-32. There is some support for surrogates when 2 bytes are used so I believe that excludes UCS-2. Note that UTF-8 would make the implementation of Python's typical string API painful; we currently assume (because it's true ;-) that random access to elements and slices (__getitem__ and __getslice__) is O(1). With UTF-8 these operations would be slow -- the simplest implementation would require counting characters from the start; one can speed this up with some kind of cache or tree but IMO the array-of-fixed-width-characters approach is much simpler. (I had a bad experience in my youth with strings implemented as trees, so I'm biased against complicated string implementations. This also explains why I'm no fan of the oft-proposed idea that slices should avoid making physical copies even if they make logical copies -- the complexity of that approach horrifies me.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
