On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <yselivanov...@gmail.com> wrote:
> Hi, > > This is the second email thread I start regarding implementing an opcode > cache in ceval loop. Since my first post on this topic: > > - I've implemented another optimization (LOAD_ATTR); > > - I've added detailed statistics mode so that I can "see" how the cache > performs and tune it; > > - some macro benchmarks are now 10-20% faster; 2to3 (a real application) > is 7-8% faster; > > - and I have some good insights on the memory footprint. > > ** The purpose of this email is to get a general approval from > python-dev, so that I can start polishing the patches and getting them > reviewed/committed. ** > > > Summary of optimizations > ------------------------ > > When a code object is executed more than ~1000 times, it's considered > "hot". It gets its opcodes analyzed to initialize caches for > LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and > LOAD_GLOBAL. > > It's important to only optimize code objects that were executed "enough" > times, to avoid optimizing code objects for modules, classes, and > functions that were imported but never used. > > The cache struct is defined in code.h [2], and is 32 bytes long. When a > code object becomes hot, it gets an cache offset table allocated for it > (+1 byte for each opcode) + an array of cache structs. > > To measure the max/average memory impact, I tuned my code to optimize > *every* code object on *first* run. Then I ran the entire Python test > suite. Python test suite + standard library both contain around 72395 > code objects, which required 20Mb of memory for caches. The test > process consumed around 400Mb of memory. Thus, the absolute worst case > scenario, the overhead is about 5%. > > Then I ran the test suite without any modifications to the patch. This > means that only code objects that are called frequently enough are > optimized. In this more, only 2072 code objects were optimized, using > less than 1Mb of memory for the cache. > > > LOAD_ATTR > --------- > > Damien George mentioned that they optimize a lot of dict lookups in > MicroPython by memorizing last key/value offset in the dict object, thus > eliminating lots of hash lookups. I've implemented this optimization in > my patch. The results are quite good. A simple micro-benchmark [3] > shows ~30% speed improvement. Here are some debug stats generated by > 2to3 benchmark: > > -- Opcode cache LOAD_ATTR hits = 14778415 (83%) > -- Opcode cache LOAD_ATTR misses = 750 (0%) > -- Opcode cache LOAD_ATTR opts = 282 > -- Opcode cache LOAD_ATTR deopts = 60 > -- Opcode cache LOAD_ATTR total = 17777912 > > Each "hit" makes LOAD_ATTR about 30% faster. > > > LOAD_GLOBAL > ----------- > > This turned out to be a very stable optimization. Here is the debug > output of the 2to3 test: > > -- Opcode cache LOAD_GLOBAL hits = 3940647 (100%) > -- Opcode cache LOAD_GLOBAL misses = 0 (0%) > -- Opcode cache LOAD_GLOBAL opts = 252 > > All benchmarks (and real code) have stats like that. Globals and > builtins are very rarely modified, so the cache works really well. With > LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash > lookup for it at all. It makes optimizations like "def foo(len=len)" > obsolete. > > > LOAD_METHOD > ----------- > > This is a new opcode I propose to add in [1]. The idea is to substitute > LOAD_ATTR with it, and avoid instantiation of BoundMethod objects. > > With the cache, we can store a reference to the method descriptor (I use > type->tp_version_tag for cache invalidation, the same thing > _PyType_Lookup is built around). > > The cache makes LOAD_METHOD really efficient. A simple micro-benchmark > like [4], shows that with the cache and LOAD_METHOD, > "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'". > > LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than > LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster. > > Here's the debug output of the 2to3 benchmark: > > -- Opcode cache LOAD_METHOD hits = 5164848 (64%) > -- Opcode cache LOAD_METHOD misses = 12 (0%) > -- Opcode cache LOAD_METHOD opts = 94 > -- Opcode cache LOAD_METHOD deopts = 12 > -- Opcode cache LOAD_METHOD dct-chk= 1614801 > -- Opcode cache LOAD_METHOD total = 7945954 > > > What's next? > ------------ > > First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 > [1]. It's a very straightforward optimization, the patch is small and > easy to review. +1 from me. > > Second, I'd like to merge the new opcode cache, see issue 26219 [5]. > All unittests pass. Memory usage increase is very moderate (<1mb for > the entire test suite), and the performance increase is significant. > The only potential blocker for this is PEP 509 approval (which I'd be > happy to assist with). > I think the fact that it improves performance across the board as well as eliminates the various tricks people use to cache global and built-ins, a big +1 from me. I guess that means Victor needs to ask for pronouncement on PEP 509. BTW, where does LOAD_ATTR fit into all of this? > > What do you think? > It all looks great to me! -Brett > > Thanks, > Yury > > > [1] http://bugs.python.org/issue26110 > [2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10 > [3] https://gist.github.com/1st1/37d928f1e84813bf1c44 > [4] https://gist.github.com/1st1/10588e6e11c4d7c19445 > [5] http://bugs.python.org/issue26219 > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/brett%40python.org >
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com