[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Tim Peters Fri, 21 Jun 2019 15:28:31 -0700

[Tim]
>> I don't think we need to cater anymore to careless code that mixes
>> system memory calls with O calls (e.g., if an extension gets memory
>> via `malloc()`, it's its responsibility to call `free()`), and if not
>> then `address_in_range()` isn't really necessary anymore either, and
>> then we could increase the pool size.  O would, however, need a new
>> way to recognize when its version of malloc punted to the system
>> malloc.


[Thomas Wouters <tho...@python.org>]
> Is this really feasible in a world where the allocators can be selected (and
> the default changed) at runtime?

I think so.  See the "Memory Management" section of the Python/C API
Reference Manual.  It's always been "forbidden" to, e.g., allocate a
thing with PyMem_New() but release it with free().  Ditto mixing a
PyMem_Raw... allocator with a PyMem... deallocator, or PyObject...
one.  Etc.

A type's tp_dealloc implementation should damn well which memory
family the type's allocator used,

However, no actual proposal on the table changes any "fact on the
ground" here.  They're all as forgiving of slop as the status quo.

> And what would be an efficient way of detecting allocations punted to
> malloc, if not address_in_range?

_The_ most efficient way is the one almost all allocators used long
ago:  use some "hidden" bits right before the address returned to the
user to store info about the block being returned.  Like 1 bit to
distinguish between "obmalloc took this out of one of its pools" and
"obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc
doesn't care)".  That would be much faster than what we do now.

But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain
alignment, so space overhead becomes 100% for the smallest objects
obmalloc can return :-(

Neil Schemenauer takes a different approach in the recent "radix tree
arena map for obmalloc" thread here.  We exchanged ideas on that until
it got to the point that the tree levels only need to trace out
prefixes of obmalloc arena addresses.  That is, the new space burden
of the radix tree appears quite reasonably small.

It doesn't appear to be possible to make it faster than the current
address_in_range(), but in small-scale testing so far speed appears
comparable.


> Getting rid of address_in_range sounds like a nice idea, and I would love to 
> test
> how feasible it is -- I can run such a change against a wide selection of code
> at work, including a lot of third-party extension modules, but I don't see an 
> easy
> way to do it right now.

Neil's branch is here:

     https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

It's effectively a different _implementation_ of the current
address_in_range(), one that doesn't ever need to read possibly
uninitialized memory, and couldn't care less about the OS page size.

For the latter reason, it's by far the clearest way to enable
expanding pool size above 4 KiB.  My PR also eliminates the pool size
limitation:

    https://github.com/python/cpython/pull/13934

but at the cost of breaking bigger pools up internally into 4K regions
so the excruciating current address_in_range black magic still works.

Neil and I are both keen _mostly_ to increase pool and arena sizes.
The bigger they are, the more time obmalloc can spend in its fastest
code paths.

A question we can't answer yet (or possibly ever) is how badly that
would hurt Python returning arenas to the system, in long-running apps
the go through phases of low and high memory need.

I don't run anything like that - not a world I've ever lived in.  All
my experiments so far say, for programs that are neither horrible nor
wonderful in this respect:

1. An arena size of 4 KiB is most effective for that.
2. There's significant degradation in moving even to 8 KiB arenas.
3. Which continues getting worse the larger the arenas.
4. Until reaching 128 KiB, at which point the rate of degradation falls a lot.

So the current 256 KiB arenas already suck for such programs.

For "horrible" programs, not even tiny 4K arenas help much.

For "wonderful" programs, not even 16 MiB arenas hurt arena recycling
effectiveness.

So if you have real programs keen to "return memory to the system"
periodically, it would be terrific to get info about how changing
arena size affects their behavior in that respect.

My PR uses 16K pools and 1M arenas, quadrupling the status quo.
Because "why not?" ;-)

Neil's branch has _generally_, but not always, used 16 MiB arenas.
The larger the arenas in his branch, the smaller the radix tree needs
to grow.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7ZIFV2BEL64FQGC35F7QUPK3SHVR3VGT/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Reply via email to