[Neil Schemenauer <nas-pyt...@arctrix.com>]
> ...
> BTW, the current radix tree doesn't even require that pools are
> aligned to POOL_SIZE.  We probably want to keep pools aligned
> because other parts of obmalloc rely on that.

obmalloc relies on it heavily.  Another radix tree could map block
addresses to all the necessary info in a current pool header, but that
could become gigantic.  For example, even a current "teensy" 256 KiB
arena can be carved into the order of 16K 16-byte blocks (the smallest
size class).  And finding a pool header now is unbeatably cheap:  just
clear the last 12 address bits, and you're done.


> Here is the matchup of the radix tree vs the current
> address_in_range() approach.
>
> - nearly the same in terms of performance.  It might depend on OS
>   and workload but based on my testing on Linux, they are very
>   close.  Would be good to do more testing but I think the radix
>   tree is not going to be faster, only slower.

People should understand that the only point to these things is to
determine whether a pointer passed to a free() or realloc() spelling
was obtained from an obmalloc pool or from the system malloc family.
So they're invoked once near the very starts of those two functions,
and that's all.

Both ways are much more expensive than finding a pool header (which is
just clearing some trailing address bits).

The current way reads an "arena index" out of the pool header, uses
that to index the file static `arenas` vector to get a pointer to an
arena descriptor, then reads the arena base address out of the
descriptor.  That;s used to determine whether the original address is
contained in the arena.  The primary change my PR makes is to read the
arena index from the start of the _page_ the address belongs instead
(the pool header address is irrelevant to this, apart from that a pool
header is aligned to the first page in a pool).

The radix tree picks bits out of the address three times to index into
a 3-level (but potentially broad) tree, ending with a node containing
info about the only two possible arenas the original address may
belong to.  Then that info is used to check.

The number of loads is essentially the same, but the multiple levels
of indexing in the tree is a teensy bit more expensive because it
requires more bit-fiddling.  I spent hours, in all, dreaming up a way
to make the _seemingly_ more complex final "so is the address in one
of those two arenas or not?" check about as cheap as the current way.
But Neil didn't see any significant timing difference after making
that change, which was mildly disappointing but not really surprising:
 arithmetic is just plain cheap compared to reading up memory.

> - radix tree uses a bit more memory overhead.  Maybe 1 or 2 MiB on a
>   64-bit OS.  The radix tree uses more as memory use goes up but it
>   is a small fraction of total used memory.  The extra memory use is
>   the main downside at this point, I think.

I'd call it the only downside.  But nobody yet has quantified how bad
it can get.


> - the radix tree doesn't read uninitialized memory.  The current
>   address_in_range() approach has worked very well but is relying on
>   some assumptions about the OS (how it maps pages into the program
>   address space).  This is the only aspect where the radix tree is
>   clearly better.  I'm not sure this matters enough to offset the
>   extra memory use.

I'm not worried about that.  The only real assumption here is that if
an OS supports some notion of "pages" at all, then for any address for
which the program has read/write access (which are the only kinds of
addresses that can be sanely passed to free/realloc), the OS allows
the same access to the entire page containing that address.

In two decades we haven't seen an exception to that yet, right?  It's
hard to imagine a HW designer thinking "I know!  Let's piss away more
transistors on finer-grained control nobody has asked for, and slow
down memory operations even more checking for that." ;-)


> - IMHO, the radix tree code is a bit simpler than Tim's
>   obmalloc-big-pool code.

Absolutely so.  There's another way to look at this:  if Vladimir
Marangozov (obmalloc's original author) had used an arena radix tree
from the start, would someone now get anywhere proposing a patch to
change it to the current scheme?  I'd expect a blizzard of -1 votes,
starting with mine ;-)

> ...
> My feeling right now is that Tim's obmalloc-big-pool is the best
> design at this point.  Using 8 KB or 16 KB pools seems to be better
> than 4 KB.  The extra complexity added by Tim's change is not so
> nice.  obmalloc is already extremely subtle and obmalloc-big-pool
> makes it more so.

Moving to bigger pools and bigger arenas are pretty much no-brainers
for us, but unless pool size is increased there's no particular reason
to pursue either approach - "ain't broke, don't fix".

Larry Hastings started a "The untuned tunable parameter ARENA_SIZE"
thread here about two years ago, where he got a blizzard of flak for
merely suggesting that perhaps, after 20 years, it may be time to
increase the 256 KiB arena size ;-)

Ah, BTW, there's another real advantage to the radix tree, although
it's minor:  less "quantization" loss in bigger pools.  For example,
we can only get 7 objects of the largest (512 bytes) size class now.
Move my PR to 16KiB pools and the pool gets 4 times bigger, but we can
still only get 7*4 = 28 objects out of that pool.  It's still
"stealing" bytes out of every page.

But the radix tree approach doesn't:  only the pool header takes away
from block space.  So move yours to 16 KiB pools, and we could get 31
512-byte blocks out of  the pool (3 more).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LQK4TMRZW4SOFJAPUXIB2AIQSOV5ZKYC/

Reply via email to