Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Maciej Fijalkowski Mon, 07 Mar 2016 01:32:13 -0800

I think you're misunderstanding what we're proposing.

We're proposing utf8 representation completely hidden from the user,
where everything behaves just like cpython unicode (the len() example
you're showing is a narrow unicode build I presume?)


On Mon, Mar 7, 2016 at 11:21 AM, hubo <h...@jiedaibao.com> wrote:
> Yes, there are two-words characters in UTF-16, as I mentioned. But len() in
> CPython returns 2 for these characters (even if they are correctly processed
> in repr()):
>
>>>> len(u'\ud805\udc09')
> 2
>>>> u'\ud805\udc09'
> u'\U00011409'
>
> (Python 3.x seems to have removed the display processing)
>
> Maybe it is better to be compatible with CPython in these situations. Since
> two-words characters are really rare in Unicode strings, programmers may not
> know their existence and allocate exactly 2 * len(s) bytes for storing an
> unicode string. It will crash the program or create security problems if
> len() return 1 for these characters even if it is the correct result
> according to Unicode standard.
>
> UTF-8 might be very useful in XML or Web processing, which is quite
> important in Python programming nowadays. But I think it is more important
> to let programmers "understand" the machanism. In C/C++, it is quite common
> to use char[] for ASCII (or ANSI) characters and wchar_t for unicode
> (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is
> actually "UTF-8" in PyPy. Web programmers who uses CPython may already be
> familiar with the differences between bytes (or str in Python2) and unicode
> (or str in Python3), it is less likely for them to design their programs
> based on special implementations of PyPy.
>
> 2016-03-07
> ________________________________
> hubo
> ________________________________
>
> 发件人：Maciej Fijalkowski <fij...@gmail.com>
> 发送时间：2016-03-07 16:46
> 主题：Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
> 收件人："hubo"<h...@jiedaibao.com>
> 抄送："Armin Rigo"<ar...@tunes.org>,"Piotr
> Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com>,"PyPy Developer Mailing
> List"<pypy-dev@python.org>
>
> Hi hubo.
>
> I think you're slightly confusing two things.
>
> UTF-16 is a variable-length encoding that has two-word characters that
> *has to* return "1" for len() of those. UCS-2 seems closer to what you
> described (which is a fixed-width encoding), but can't encode all the
> unicode characters and as such is unsuitable for a modern unicode
> representation.
>
> I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the
> slicing and size calculations still has to be as complicated as for
> UTF-8.
>
> Complicated logic in repr() - those are not usually performance
> critical parts of your program and it's ok to have some complications
> there.
>
> It's true that UTF-16 can be less efficient than UTF-8 for certain
> languages, however both are more memory efficient than what we
> currently use (UCS4). There are however some problems - even if you
> work exclusively in, say, korean, for example web servers still have
> to deal with some parts that are ascii (html markup, css etc.) while
> handling text in korean. In those cases UTF8 vs UTF16 is more muddled
> and the exact details depend a lot. We also need to consider the fact
> that we ship one canonical PyPy to everybody - people using different
> languages and different encodings.
>
> Overall, UTF8 seems like definitely a better alternative than UCS4
> (also for asian languages), which is what we are using now and I would
> be inclined to leave UTF16 as an option to see if it performs better
> for certain benchmarks.
>
> Best regards,
> Maciej Fijalkowski
>
> On Mon, Mar 7, 2016 at 9:58 AM, hubo <h...@jiedaibao.com> wrote:
>> I think it is not reasonable to use UTF-8 to represent the unicode string
>> type.
>>
>>
>> 1. Less storage - this is not always true. It is only true for strings
>> with
>> a lot of ASCII characters. In Asia, most strings in local languages
>> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume
>> more
>> storage than in UTF-16. To make things worse, while it always consumes 2*N
>> bytes for a N-characters string in UTF-16, it is difficult to estimate the
>> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes)
>> (UTF-16 also has two-word characters, but len() reports 2 for these
>> characters, I think it is not harmful to treat them as two characters)
>>
>> 2. There would be very complicate logics for size calculating and slicing.
>> For UTF-16, every character is represented with a 16-bit integer, so it is
>> convient for size calculating and slicing. But character in UTF-8 consumes
>> variant bytes, so either we call mb_* string functions instead (which is
>> slow in nature) or we use special logic like storing indices of characters
>> in another array (which introduces cost for extra addressings).
>>
>> 3. When displaying with repr(), non-ASCII characters are displayed with
>> \uXXXX format. If the internal storage for unicode is UTF-8, the only way
>> to
>> be compatible with this format is to convert it back to UTF-16.
>>
>> It may be wiser to let programmers deside which encoding they would like
>> to
>> use. If they want to process UTF-8 strings without performance cost on
>> converting, they should use "bytes". When correct size calculating and
>> slicing of non-ASCII characters are concerned it may be better to use
>> "unicode".
>>
>> 2016-03-07
>> ________________________________
>> hubo
>> ________________________________
>>
>> 发件人：Armin Rigo <ar...@tunes.org>
>> 发送时间：2016-03-05 16:09
>> 主题：Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
>> 收件人："Piotr Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com>
>> 抄送："PyPy Developer Mailing List"<pypy-dev@python.org>
>>
>> Hi Piotr,
>>
>> Thanks for giving some serious thoughts to the utf8-stored unicode
>> string proposal!
>>
>> On 5 March 2016 at 01:48, Piotr Jurkiewicz
>> <piotr.jerzy.jurkiew...@gmail.com> wrote:
>>>     Random access would be as follows:
>>>
>>>         page_num, byte_in_page = divmod(codepoint_pos, 64)
>>>         page_start_byte = index[page_num]
>>>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
>>>         return buffer[exact_byte]
>>
>> This is the part I'm least sure about: seek_forward() needs to be a
>> loop over 0 to 63 codepoints.  True, each loop can be branchless, and
>> very short---let's say 4 instructions.  But it still makes a total of
>> up to 252 instructions (plus the checks to know if we must go on).
>> These instructions are all or almost all dependent on the previous
>> one: you must have finished computing the length of one sequence to
>> even being computing the length of the next one.  Maybe it's faster to
>> use a more "XMM-izable" algorithm which counts 0 for each byte in
>> 0x80-0xBF and 1 otherwise, and makes the sum.
>>
>> There are also variants, e.g. adding a second array of words similar
>> to 'index', but where each word is 8 packed bytes giving 8 starting
>> points inside the page (each in range 0-252).  This would reduce the
>> walk to 0-7 codepoints.
>>
>> I'm +1 on your proposal. The whole thing is definitely worth a try.
>>
>>
>> A bientôt,
>>
>> Armin.
>> _______________________________________________
>> pypy-dev mailing list
>> pypy-dev@python.org
>> https://mail.python.org/mailman/listinfo/pypy-dev
>>
>>
>> _______________________________________________
>> pypy-dev mailing list
>> pypy-dev@python.org
>> https://mail.python.org/mailman/listinfo/pypy-dev
>>
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Reply via email to