Guido van Rossum wrote: > Note that UTF-8 would make the implementation of Python's typical > string API painful; we currently assume (because it's true ;-) that > random access to elements and slices (__getitem__ and __getslice__) is > O(1). With UTF-8 these operations would be slow -- the simplest > implementation would require counting characters from the start; one > can speed this up with some kind of cache or tree but IMO the > array-of-fixed-width-characters approach is much simpler. (I had a bad > experience in my youth with strings implemented as trees, so I'm > biased against complicated string implementations.
I'm still thinking that it might be a good idea to (optionally) delay de- coding of strings until you're actually doing something that needs access to the individual characters, though. (UTF-8 to UTF-8 shuffling is an increasingly common use case). (frankly, I wouldn't rule out using an "internally polymorphic" representation for the new str type, partially motivated by my experiences from cElement- Tree). > This also explains why I'm no fan of the oft-proposed idea that slices > should avoid making physical copies even if they make logical copies -- > the complexity of that approach horrifies me.) that could also be an optional mechanism for advanced users, but I agree that it needs a simple implementation. I think some experimentation is required here (and hope to find some time for that in a not very distant future). </F> _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
