[Inada Naoki <songofaca...@gmail.com>, trying mimalloc]
>>> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
>>> benchmarks.  I will look it later.

>> ...
>> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
>> Slower (60):
>> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
>> slower (+158%)
>> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
>> slower (+127%)

> I think I understand why the mimalloc uses more than twice memory than the
> pymalloc + glibc malloc in logging_format and logging_simple benchmarks.
>
> These two benchmarks does like this:
>
>     buf = [] # in StringIO
>     for _ in range(10*1024):
>         buf.append("important: some important information to be logged")
>     s = "".join(buf)  # StringIO.getvalue()
>     s.splitlines()
>
> mimalloc uses size segregated allocator for ~512KiB.  And size class
> is determined top three bits.
> On the other hand, list increases capacity by 9/8.  It means next size
> class is used on each realloc.

Often, but not always (multiplication by 9/8 may not change the top 3
bits - e.g., 128 * 9/8 = 144).

>  At last, all size classes has1~3 used/cached memory blocks.

No doubt part of it, but hard to believe it's most of it.  If the loop
count above really is 10240, then there's only about 80K worth of
pointers in the final `buf`.  To account for a difference of over 10M,
it would need to have left behind well over 100 _full_ size copies
from earlier reallocs.

in fact the number of list elements across resizes goes like so:

0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945

Adding all of those sums to 96,113, so accounts for less than 1M of
8-byte pointers if none were ever released.  mimalloc will, of course,
add its own slop on top of that - but not a factor of ten's worth.
Unless maybe it's using a dozen such buffers at once?

But does it really matter? ;-)  mimalloc "should have" done MADV_FREE
on the pages holding the older `buf` instances, so it's not like the
app is demanding to hold on to the RAM (albeit that it may well show
up in the app's RSS unless/until the OS takes the RAM away).


> This is almost worst case for mimalloc.  In more complex application,
> there may be more chance to reuse memory blocks.
>
> In complex or huge application, this overhead will become relatively small.
> It's speed is attractive.
>
> But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
> common cases.

The mimalloc page says that, in their benchmarks:

"""
In our benchmarks (see below), mimalloc always outperforms all other
leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses
less memory (up to 25% more in the worst case).
"""

obmalloc is certainly more "memory efficient" (for some meanings of
that phrase) for smaller objects:  in 3.7 it splits objects of <= 512
bytes into 64 size classes.  mimalloc also has (close to) 64 "not
gigantic" size classes, but those cover a range of sizes over a
thousand times wider (up to about half a meg).  Everything obmalloc
handles fits in mimalloc's first 20 size classes.  So mimalloc
routinely needs more memory to satisfy a "small object" request than
obmalloc does.

I was more intrigued by your first (speed) comparison:

> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)

Now _that's_ interesting ;-)  Looks like spectral_norm recycles many
short-lived Python floats at a swift pace.  So memory management
should account for a large part of its runtime (the arithmetic it does
is cheap in comparison), and obmalloc and mimalloc should both excel
at recycling mountains of small objects.  Why is mimalloc
significantly faster?  This benchmark should stay in the "fastest
paths" of both allocators most often, and they both have very lean
fastest paths (they both use pool-local singly-linked sized-segregated
free lists, so malloc and free for both should usually amount to just
popping or pushing one block off/on the head of the appropriate list).

 obmalloc's `address_in_range()` is definitely a major overhead in its
fastest `free()` path, but then mimalloc has to figure out which
thread is doing the freeing (looks cheaper than address_in_range, but
not free).  Perhaps the layers of indirection that have been wrapped
around obmalloc over the years are to blame?  Perhaps mimalloc's
larger (16x) pools and arenas let it stay in its fastest paths more
often?  I don't know why, but it would be interesting to find out :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/

Reply via email to