[Inada Naoki <songofaca...@gmail.com>, looking into why mimalloc did so much better on spectral_norm] > I compared "perf" output of mimalloc and pymalloc, and I succeeded to > optimize pymalloc! > > $ ./python bm_spectral_norm.py --compare-to ./python-master > python-master: ..................... 199 ms +- 1 ms > python: ..................... 182 ms +- 4 ms > > Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +- > 4 ms: 1.10x faster (-9%)
Yay!! Nice :-) > mimalloc uses many small static (inline) functions. Too many , if you ask me ;-) Really, the enormous number of tiny functions makes the code very hard to follow for me. OTOH, the tiny number of enormous functions in pymalloc makes that hard to follow too :-( > On the other hand, pymalloc_alloc and pymalloc_free is large function > containing slow/rare path. obmalloc.c was written when compiler inlining was barely a thing. Some restructuring is a fine idea - overdue, in fact. > PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines > pymalloc_free. > But compiler doesn't know which is the hot part in pymalloc_alloc and > pymalloc_free. > So gcc failed to chose code to inline. Remaining part of > pymalloc_alloc and pymalloc_free > are called and many push/pop are executed because they contains complex logic. > > So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part. > But I need to use > "static inline" for pymalloc_alloc and pymalloc_free yet [1]. > Generated assembly is optimized > well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3]. > But there are many code duplication in PyObject_Malloc and > PyObject_Calloc, etc... > > [1] https://github.com/python/cpython/pull/14674/files > [2] > https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L232-L274 > [3] > https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L2-L32 > > I will try to split pymalloc_alloc and pymalloc_free to smaller functions. Yes, splitting out the slower paths should be a win. Note this is one reason I remain keen to increase pool size: the bigger the pools, the more objects can be handed out & reclaimed _from_ the fastest paths. You've now discovered that "the fastest paths" are, for pragmatic compiler-is-overwhelmed reasons, significantly slower than they "should be". > Except above, there is one more important difference. > > pymalloc return free pool to freepool list soon when pool become empty. Fine by me, & I think it should continue to do so. Details below. > On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc) > not everytime when it's empty [4]. So they can avoid rebuilding linked list > of > free blocks. pymalloc never returns anything to the system at smaller than "arena" granularity, so it's not a big deal at all _merely_ to link and unlink a pool to/from an arena's freepool list. There's no possibility than any byte in the pool can change while the pool is sitting on a freepools list. If we freed the last pool of a given size class, and next need to allocate another object of that size class (most common), it's cheap: when the pool is unlinked from the pool free list, it sees that the pool's last size class is the same as the size class being requested, and skips almost all pool initialization (the pool is already fully prepared for objects of this size class). init_pool: /* Frontlink to used pools. */ next = usedpools[size + size]; /* == prev */ pool->nextpool = next; pool->prevpool = next; next->nextpool = pool; next->prevpool = pool; pool->nalloc = 1; if (pool->szidx == size) { // xxxxxxxx HERE xxxxxxxc /* Luckily, this pool last contained blocks * of the same size class, so its header * and free list are already initialized. */ bp = pool->freeblock; assert(bp != NULL); pool->freeblock = *(block **)bp; goto success; // xxxxxxxxxxx FASTEST PATH xxxxxxxxxxxxxx } // and here there's a mound of code in case // the size classes don't match, including code // to restart the process of linking the pool's // blocks into a free list. So In the common case, the pool's free list is reused exactly as-is at the time the pool was linked to the freepool list. > I think pymalloc should do same optimization. As above, I believe pymalloc is already optimal in this case, On the other end, something to consider: when a block is freed from an entirely used pool, the pool now contains one free block, and is linked to the front of usedpools[sizeclass]. So the next request for a block of that size will come from that pool, which will again cause the pool to become entirely used. That's a kind of possible thrashing, and acts against mimalloc's (sane, to my eyes) desire to _entirely_ use up a given pool ("page") once it starts passing things out from a pool. It could well be better to, e.g., link the pool-with-one-newly-free-block to the end of the usedpool list instead. But it's unclear. The reason it front-links is that the block being freed is presumably in L1 cache (because, e.g., its refcount just fell to 0), so best to hand it out again ASAP. But that old reasoning applied when pymalloc was _only_ used for PyObject-like structs. > [4] > https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e08b3405/src/page.c#L365-L375 > > BTW, which is proper name? pymalloc, or obmalloc. I think it's pymalloc, so that's what I used above, I got lazy before ;-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/I4CX5YNCGPS52JZSFKXMTDTGDOP2Y2WG/