[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Tim Peters Sun, 23 Jun 2019 10:36:33 -0700

[Thomas]
>>> And what would be an efficient way of detecting allocations punted to
>>> malloc, if not address_in_range?


[Tim]
>> _The_ most efficient way is the one almost all allocators used long
>> ago:  use some "hidden" bits right before the address returned to the
>> user to store info about the block being returned.

[Antoine]
> There's a fundamental problem here: you can't be sure that all
> allocators reserve such space.  If some allocator doesn't, it can
> return a pointer just at the very start of the page, and if obmalloc
> tries to look at "a few bits before" that address, it could very well
> page-fault.

I snipped some technical but crucial context in my reply to Thomas:
this was assuming users are following the documented requirement to
never mix memory calls from different families.

What you describe certainly could happen in "illegal" code that. e.g.,
obtained a block from the system malloc, but passed it to obmalloc to
free.  Which, in reality, works fine today, although the docs forbid
it.  (I'm not sure, but there _may_ be some mode of running today that
actually enforces the doc requirements.)

In the other world, where code is playing by the rules, if an obmalloc
malloc/;realloc spelling is called, and it needs to punt to a
different allocator, no problem:  first it boosts the size request so
it has room to store "the bit" it needs before the address it actually
returns to the client.  Then it's "legal" only to free that memory
with an obmalloc spelling of free() later - obmalloc reads "the bit",
sees "oh - that's not my memory!", and adjusts the pointer accordingly
on _its_ call to the spelling of free() corresponding to the memory
family obmalloc() used to get the memory to begin with.

>> Neil Schemenauer takes a different approach in the recent "radix tree
>> arena map for obmalloc" thread here. ...

> One open question is the cache efficiency of the two approaches.
> Intuitively, address_in_range() will often look at exactly the same
> cache line (since a significant number of allocation will share the
> same "page prefix").

I believe we haven't seen a program yet that used more than one node
at the tree's top level :-)  But who knows?  mmap() and VirtualAlloc()
don't appear to make any _guarantees_ that the high-order bits of
returned addresses aren't entirely random.  In real life so far, they
always appear to be zeroes.

While x86 has a 64-bit virtual address space, the hardware "only"
supports 48 bits of physical address space, and I haven't seen a
virtual address yet where any of the top 16 bits are set.

AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47.

Blah blah blah - for the foreseeable future, the top level of the tree
has a very easy job.

And Neil keenly observed that the top level of the tree can be
_declared_  as being very broad (and so suck up a lot of the leading
bits), because it's a file static and is effectively just an address
space reservation (at least on Linux) until nodes in it actually get
used.

>  Apparently, this benefit may be offset by cache
> aliasing issues.  Cache aliasing can also be mitigated by the fact that
> CPU caches are most of the time N-way with N > 1 (but N generally
> remains small, from 2 to 8, for L1 and L2 caches).
>
> So I guess the a priori answer is "it's complicated" :-)

Indeed it is.

> I must also thank both you and Neil for running these experiments, even
> though sometimes I may disagree about the conclusions :-)

Well, there aren't any conclusions yet - just seeing the same things repeatedly.

Over the weekend, Neil ran many variations of a longish-running "real
job" related to his work, which goes through phases of processing bulk
database operations and "trying to" release the space (details are
complicated).

Arena recycling was essentially non-existent in either branch (my PR
or the radix tree).

In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but
on closer examination they were virtually all of the "useless arena
thrashing" flavor.  The arenas in use were almost always within one of
the highwater mark.

But it got much better in the branches if arenas were shrunk to a tiny 8 KiB.

Which is just another instance of the "256 KiB arenas are already way
too fat for effective arena recycling unless the program is
exceptionally friendly in its dynamic memory use patterns"
observation.

We haven't seen a case yet where 1 MiB arenas do materially worse than
256 KiB ones in this respect.

Speed is generally a wash between the branches, although they
consistently appear to be faster (by a little, not a lot) than 3.7.3.

The radix tree generally appears to be a little more memory-frugal
than my PR (presumably because my need to break "big pools" into 4K
chunks, while the tree branch doesn't, buys the tree more space to
actually store objects than it costs for the new tree).

We need more "real jobs", and especially ones where arena recycling is
actually doing good in 3.7.3 (which can be hard to tell, because the
"number of arenas recycled" output in 3.7.3 is vastly inflated by
arena-thrashing noise).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2E5SGN74XUENR4SZRF3UIJ6KSPPYQ2MN/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Reply via email to