> There are many design alternatives: one option would be to support > *three* internal representations in a single type, generating the > others from the one operation existing as needed. The default, initial > representation might be UTF-8, with UCS-4 only being generated when > indexing occurs, and UCS-2 only being generated when the API requires > it. On concatenation, always concatenate just one represenation: either > one that is already present in both operands, else UTF-8.
Wouldn't it be simpler to use: - one-byte representation if every character <= 0xFF - two-byte representation if every character <= 0xFFFF - four-byte representation otherwise Then combining several strings means using the larger representation as a result (*). In practice, most use cases will not involve the four-byte representation. (*) a heuristic can be invented so that, when producing a smaller string (by stripping/slicing/etc.), it will "sometimes" check whether a narrower representation is possible. For example : store the length of the string when the last check occurred, and do a new check when the length falls below the half that value. Regards Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com