Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

Antoine Pitrou Mon, 24 Oct 2005 14:22:29 -0700

> There are many design alternatives: one option would be to support
> *three* internal representations in a single type, generating the
> others from the one operation existing as needed. The default, initial
> representation might be UTF-8, with UCS-4 only being generated when
> indexing occurs, and UCS-2 only being generated when the API requires
> it. On concatenation, always concatenate just one represenation: either
> one that is already present in both operands, else UTF-8.


Wouldn't it be simpler to use:
- one-byte representation if every character <= 0xFF
- two-byte representation if every character <= 0xFFFF
- four-byte representation otherwise

Then combining several strings means using the larger representation as
a result (*). In practice, most use cases will not involve the four-byte
representation.

(*) a heuristic can be invented so that, when producing a smaller string
(by stripping/slicing/etc.), it will "sometimes" check whether a
narrower representation is possible.
For example : store the length of the string when the last check
occurred, and do a new check when the length falls below the half that
value.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

Reply via email to