"Adam Olsen" <[EMAIL PROTECTED]> wrote: > Before we can decide on the internal representation of our unicode > objects, we need to decide on their external interface. My thoughts > so far:
I believe the only options up for actual decision is what the internal representation of a unicode object will be. Utf-8 that is never changed? Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4, depending on compiler switch? > * Most transformation and testing methods (.lower(), .islower(), etc) > can be copied directly from 2.x. They require no special > implementation to perform reasonably. A decoding variant of these would be required if the underlying representation of a particular string is not latin-1, ucs-2, or ucs-4. Further, any rstrip/split/etc. methods need to scan/parse the entire string in order to discover code point starts/ends when using a utf-* variant as an internal encoding (except for utf-32, which has a constant width per character). Whether or not we choose to go with a varying internal representation (the latin-1/ucs-2/ucs-4 variant I have been suggesting), > * Indexing and slicing is the big issue. Do we need constant-time > integer slicing? .find() could be changed to return a token that > could be used as a constant-time offset. Incrementing the token would > have linear costs, but that's no big deal if the offsets are always > small. If by "constant-time integer slicing" you mean "find the start and end memory offsets of a slice in constant time", I would say yes. Generally, I think tokens (in unicode strings) are a waste of time and implementation. Giving each string a fixed-width per character allows methods on those unicode strings to be far simpler in implementation. > * Grapheme clusters, words, lines, other groupings, do we need/want > ways to slice based on them too? No. > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > want to support them? Now would be the time. This would imply a tree-based string, which Guido has specifically stated would not happen. Never mind that it would be a beast to implement and maintain or that it would exclude the possibility for offering the single-segment buffer interface, without reprocessing. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
