[Antoine Pitrou, replying to Thomas Wouters] > Interesting that a 20-year simple allocator (obmalloc) is able to do > better than the sophisticated TCMalloc.
It's very hard to beat obmalloc (O) at what it does. TCMalloc (T) is actually very similar where they overlap, but has to be more complex because it's trying to do more than O. In either case, for small objects "the fast path" consists merely of taking the first block of memory off a singly-linked size-segregated free list. For freeing, the fast path is just linking the block back to the front of the appropriate free list. What _could_ be faster? A "bump allocator" allocates faster (just increment a highwater mark), but creates worlds of problems when freeing. But because O is only trying to deal with small (<= 512 bytes) requests, it can use a very fast method based on trivial address arithmetic to find the size of an allocated block by just reading it up from the start of the (4K) "pool" the address belongs to. T can't do that - it appears to need to look up the address in a more elaborate radix tree, to find info recording the size of the block (which may be just about anything - no upper limit). > (well, of course, obmalloc doesn't have to worry about concurrent > scenarios, which explains some of the simplicity) Right, T has a different collection of free lists for each thread. so on each entry has to figure out which collection to use (and so doesn't need to lock). That's not free. O only has one collection, and relies on the GIL. Against that, O burns cycles worrying about something else: because it was controversial when it was new, O thought it was necessary to handle free/realloc calls even when passed addresses that had actually been obtained from the system malloc/realloc. The T docs I saw said "don't do that - things will blow up in mysterious ways". That's where O's excruciating "address_in_range()" logic comes from. While that's zippy and scales extremely well (it doesn't depend on how many objects/arenas/pools exist), it's not free, and is a significant part of the "fast path" expense for both allocation and deallocation. It also limits us to a maximum pool size of 4K (to avoid possible segfaults when reading up memory that was actually obtained from the system malloc/realloc), and that's become increasingly painful: on 64-bit boxes the bytes lost to pool headers increased, and O changed to handle requests up to 512 bytes instead of its original limit of 256. O was intended to supply "a bunch" of usable blocks per pool, not just a handful. We "should" really at least double the pool and arena sizes now. I don't think we need to cater anymore to careless code that mixes system memory calls with O calls (e.g., if an extension gets memory via `malloc()`, it's its responsibility to call `free()`), and if not then `address_in_range()` isn't really necessary anymore either, and then we could increase the pool size. O would, however, need a new way to recognize when its version of malloc punted to the system malloc. BTW, one more: last I saw T never returns memory to "the system", but O does - indeed, the parent thread here was all about _enormous_ time waste due to that in O ;-) That's not free either, but doesn't affect O's fast paths. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com