Hi,
On 07/03/16 08:58, hubo wrote:
I think it is not reasonable to use UTF-8 to represent the unicode
string type.
1. Less storage - this is not always true. It is only true for strings
with a lot of ASCII characters. In Asia, most strings in local languages
(Japanese, Chinese, Korean) are non-ASCII characters, they may consume
more storage than in UTF-16. To make things worse, while it always
consumes 2*N bytes for a N-characters string in UTF-16, it is difficult
to estimate the size of a N-characters string in UTF-8 (may be N bytes
to 3 * N bytes)
(UTF-16 also has two-word characters, but len() reports 2 for these
characters, I think it is not harmful to treat them as two characters)
Note that in PyPy unicode strings use UTF-32 as the internal
representation for all platforms, so the space saving would be larger.
Note also that currently almost all I/O operations on many platforms do
a conversion from UTF-8 to UTF-32 and back, which involves a copy and is
costly.
2. There would be very complicate logics for size calculating and
slicing. For UTF-16, every character is represented with a 16-bit
integer, so it is convient for size calculating and slicing. But
character in UTF-8 consumes variant bytes, so either we call mb_* string
functions instead (which is slow in nature) or we use special logic like
storing indices of characters in another array (which introduces cost
for extra addressings).
This is true, some engineering would have to go into this part of the
representation.
3. When displaying with repr(), non-ASCII characters are displayed with
\uXXXX format. If the internal storage for unicode is UTF-8, the only
way to be compatible with this format is to convert it back to UTF-16.
It may be wiser to let programmers deside which encoding they would like
to use. If they want to process UTF-8 strings without performance cost
on converting, they should use "bytes". When correct size calculating
and slicing of non-ASCII characters are concerned it may be better to
use "unicode".
I think repr is allowed to be a somewhat slow operation.
Cheers,
Carl Friedrich
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev