I think you're misunderstanding what we're proposing. We're proposing utf8 representation completely hidden from the user, where everything behaves just like cpython unicode (the len() example you're showing is a narrow unicode build I presume?)
On Mon, Mar 7, 2016 at 11:21 AM, hubo <h...@jiedaibao.com> wrote: > Yes, there are two-words characters in UTF-16, as I mentioned. But len() in > CPython returns 2 for these characters (even if they are correctly processed > in repr()): > >>>> len(u'\ud805\udc09') > 2 >>>> u'\ud805\udc09' > u'\U00011409' > > (Python 3.x seems to have removed the display processing) > > Maybe it is better to be compatible with CPython in these situations. Since > two-words characters are really rare in Unicode strings, programmers may not > know their existence and allocate exactly 2 * len(s) bytes for storing an > unicode string. It will crash the program or create security problems if > len() return 1 for these characters even if it is the correct result > according to Unicode standard. > > UTF-8 might be very useful in XML or Web processing, which is quite > important in Python programming nowadays. But I think it is more important > to let programmers "understand" the machanism. In C/C++, it is quite common > to use char[] for ASCII (or ANSI) characters and wchar_t for unicode > (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is > actually "UTF-8" in PyPy. Web programmers who uses CPython may already be > familiar with the differences between bytes (or str in Python2) and unicode > (or str in Python3), it is less likely for them to design their programs > based on special implementations of PyPy. > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > 发件人:Maciej Fijalkowski <fij...@gmail.com> > 发送时间:2016-03-07 16:46 > 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > 收件人:"hubo"<h...@jiedaibao.com> > 抄送:"Armin Rigo"<ar...@tunes.org>,"Piotr > Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com>,"PyPy Developer Mailing > List"<pypy-dev@python.org> > > Hi hubo. > > I think you're slightly confusing two things. > > UTF-16 is a variable-length encoding that has two-word characters that > *has to* return "1" for len() of those. UCS-2 seems closer to what you > described (which is a fixed-width encoding), but can't encode all the > unicode characters and as such is unsuitable for a modern unicode > representation. > > I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the > slicing and size calculations still has to be as complicated as for > UTF-8. > > Complicated logic in repr() - those are not usually performance > critical parts of your program and it's ok to have some complications > there. > > It's true that UTF-16 can be less efficient than UTF-8 for certain > languages, however both are more memory efficient than what we > currently use (UCS4). There are however some problems - even if you > work exclusively in, say, korean, for example web servers still have > to deal with some parts that are ascii (html markup, css etc.) while > handling text in korean. In those cases UTF8 vs UTF16 is more muddled > and the exact details depend a lot. We also need to consider the fact > that we ship one canonical PyPy to everybody - people using different > languages and different encodings. > > Overall, UTF8 seems like definitely a better alternative than UCS4 > (also for asian languages), which is what we are using now and I would > be inclined to leave UTF16 as an option to see if it performs better > for certain benchmarks. > > Best regards, > Maciej Fijalkowski > > On Mon, Mar 7, 2016 at 9:58 AM, hubo <h...@jiedaibao.com> wrote: >> I think it is not reasonable to use UTF-8 to represent the unicode string >> type. >> >> >> 1. Less storage - this is not always true. It is only true for strings >> with >> a lot of ASCII characters. In Asia, most strings in local languages >> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume >> more >> storage than in UTF-16. To make things worse, while it always consumes 2*N >> bytes for a N-characters string in UTF-16, it is difficult to estimate the >> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) >> (UTF-16 also has two-word characters, but len() reports 2 for these >> characters, I think it is not harmful to treat them as two characters) >> >> 2. There would be very complicate logics for size calculating and slicing. >> For UTF-16, every character is represented with a 16-bit integer, so it is >> convient for size calculating and slicing. But character in UTF-8 consumes >> variant bytes, so either we call mb_* string functions instead (which is >> slow in nature) or we use special logic like storing indices of characters >> in another array (which introduces cost for extra addressings). >> >> 3. When displaying with repr(), non-ASCII characters are displayed with >> \uXXXX format. If the internal storage for unicode is UTF-8, the only way >> to >> be compatible with this format is to convert it back to UTF-16. >> >> It may be wiser to let programmers deside which encoding they would like >> to >> use. If they want to process UTF-8 strings without performance cost on >> converting, they should use "bytes". When correct size calculating and >> slicing of non-ASCII characters are concerned it may be better to use >> "unicode". >> >> 2016-03-07 >> ________________________________ >> hubo >> ________________________________ >> >> 发件人:Armin Rigo <ar...@tunes.org> >> 发送时间:2016-03-05 16:09 >> 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage >> 收件人:"Piotr Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com> >> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> >> >> Hi Piotr, >> >> Thanks for giving some serious thoughts to the utf8-stored unicode >> string proposal! >> >> On 5 March 2016 at 01:48, Piotr Jurkiewicz >> <piotr.jerzy.jurkiew...@gmail.com> wrote: >>> Random access would be as follows: >>> >>> page_num, byte_in_page = divmod(codepoint_pos, 64) >>> page_start_byte = index[page_num] >>> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >>> return buffer[exact_byte] >> >> This is the part I'm least sure about: seek_forward() needs to be a >> loop over 0 to 63 codepoints. True, each loop can be branchless, and >> very short---let's say 4 instructions. But it still makes a total of >> up to 252 instructions (plus the checks to know if we must go on). >> These instructions are all or almost all dependent on the previous >> one: you must have finished computing the length of one sequence to >> even being computing the length of the next one. Maybe it's faster to >> use a more "XMM-izable" algorithm which counts 0 for each byte in >> 0x80-0xBF and 1 otherwise, and makes the sum. >> >> There are also variants, e.g. adding a second array of words similar >> to 'index', but where each word is 8 packed bytes giving 8 starting >> points inside the page (each in range 0-252). This would reduce the >> walk to 0-7 codepoints. >> >> I'm +1 on your proposal. The whole thing is definitely worth a try. >> >> >> A bientôt, >> >> Armin. >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev@python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev@python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev