Hi Piotr. Any chance to have a chat with you about the proposal on a more real-time communication medium like IRC or GChat? (it's #pypy on IRC and use my mail for gchat)
On Sat, Mar 5, 2016 at 2:48 AM, Piotr Jurkiewicz <piotr.jerzy.jurkiew...@gmail.com> wrote: > Hi PyPy devs, > > my name is Piotr Jurkiewicz and I am a first-year PhD student at > the AGH University of Science and Technology, Kraków, Poland. > > I am writing this email to make sure that PyPy is going to > participate in GSoC 2016, since I am interested in one of the > proposed projects: Optimized Unicode Representation > > Below is a list of my ideas and plan for the project. > > (I use Python 2 nomenclature, that is unicode strings are > `unicode` objects and bytes strings are `str` objects.) > > 1. Store all unicode objects contents internally as UTF-8. > > This would reduce size of stored contents and allow external > libraries, which expect UTF-8, to process contents directly in the > memory (for example using various regexp libraries to search unicode > string). > > 2. Unify interning caches for str and unicode. > > This would allow unicode objects and corresponding > utf8-encoded-str objects to share the same interned buffer. > > For example unicode object u'koń' would share interned buffer > with str 'ko\xc5\x84'. > > This would make unicode.encode('utf-8') basically no op. As UTF-8 > becomes dominant encoding for any data exchange, including web (86%) > [1], more and more data coming out from Python scripts needs to be > UTF-8 encoded. Therefore, it is important to make this operation as > cheap as possible. > > It would speed up str.decode('utf-8') significantly too, although it > wouldn't make it no op. String still would need to be checked if it > is a correct UTF-8 string when transforming to unicode object. But > we can get rid of additional allocation, copying string contents and > storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE. > > 3. Indexing of codepoints positions, what would allow O(1) random > access and slicing. > > The idea is simple: alongside contents of each interned unicode > object, store an array of unsigned integers. These integers will > be positions (in bytes), counting from the beginning of the buffer, > at which each next 64-codepoint-long 'pages' start. > > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] > > Using 64-byte long pages, like in the example above, would allow > O(1) random access, with constant terms of: > > - one cache access in cases of only-ASCII texts (indexes for such > unicode objects will not be created and maintained) > - three cache accesses in cases of texts consisting of ASCII mixed > with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic > alphabets) > - four or five cache accesses in cases of texts consisting mostly of > three- and four- byte characters > > (all above assuming 64-byte long CPU cache lines) > > Memory overhead associated with storing index array would be in > range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32 > codepoints will be allowed) > > (assuming that the index array consists of integers of smallest > possible type which can store buffer_bytes_len - 1) > > 4. Fast codepoints counting/seeking with branchless algorithm [2]. > > When unicode object is interned, we are sure that it is a correct > UTF-8 string. Therefore, there is no need for correctness checking > when seeking, so a branchless algorithm can be used. > > [1]: http://w3techs.com/technologies/details/en-utf8/all/all > [2]: > http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html > > All of these changes can be introduced one at a time, what would > improve tracking of performance changes and debugging of eventual > errors. > > After completing the project I plan to write a paper describing > speedup method of random access unicode access based on indexing, as > this method has a potential for being used in other language > interpreters which have immutable and/or interned unicode strings. > Note that similar index can be created for graphemes as well, so > this method can be used in languages which provide grapheme-based > interface (like Perl 6). > > Please share your thoughts about these ideas. > > Cheers, > Piotr > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev