[Inada Naoki <songofaca...@gmail.com>, trying mimalloc] >>> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some >>> benchmarks. I will look it later.
>> ... >> $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G >> Slower (60): >> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x >> slower (+158%) >> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x >> slower (+127%) > I think I understand why the mimalloc uses more than twice memory than the > pymalloc + glibc malloc in logging_format and logging_simple benchmarks. > > These two benchmarks does like this: > > buf = [] # in StringIO > for _ in range(10*1024): > buf.append("important: some important information to be logged") > s = "".join(buf) # StringIO.getvalue() > s.splitlines() > > mimalloc uses size segregated allocator for ~512KiB. And size class > is determined top three bits. > On the other hand, list increases capacity by 9/8. It means next size > class is used on each realloc. Often, but not always (multiplication by 9/8 may not change the top 3 bits - e.g., 128 * 9/8 = 144). > At last, all size classes has1~3 used/cached memory blocks. No doubt part of it, but hard to believe it's most of it. If the loop count above really is 10240, then there's only about 80K worth of pointers in the final `buf`. To account for a difference of over 10M, it would need to have left behind well over 100 _full_ size copies from earlier reallocs. in fact the number of list elements across resizes goes like so: 0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945 Adding all of those sums to 96,113, so accounts for less than 1M of 8-byte pointers if none were ever released. mimalloc will, of course, add its own slop on top of that - but not a factor of ten's worth. Unless maybe it's using a dozen such buffers at once? But does it really matter? ;-) mimalloc "should have" done MADV_FREE on the pages holding the older `buf` instances, so it's not like the app is demanding to hold on to the RAM (albeit that it may well show up in the app's RSS unless/until the OS takes the RAM away). > This is almost worst case for mimalloc. In more complex application, > there may be more chance to reuse memory blocks. > > In complex or huge application, this overhead will become relatively small. > It's speed is attractive. > > But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for > common cases. The mimalloc page says that, in their benchmarks: """ In our benchmarks (see below), mimalloc always outperforms all other leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses less memory (up to 25% more in the worst case). """ obmalloc is certainly more "memory efficient" (for some meanings of that phrase) for smaller objects: in 3.7 it splits objects of <= 512 bytes into 64 size classes. mimalloc also has (close to) 64 "not gigantic" size classes, but those cover a range of sizes over a thousand times wider (up to about half a meg). Everything obmalloc handles fits in mimalloc's first 20 size classes. So mimalloc routinely needs more memory to satisfy a "small object" request than obmalloc does. I was more intrigued by your first (speed) comparison: > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) Now _that's_ interesting ;-) Looks like spectral_norm recycles many short-lived Python floats at a swift pace. So memory management should account for a large part of its runtime (the arithmetic it does is cheap in comparison), and obmalloc and mimalloc should both excel at recycling mountains of small objects. Why is mimalloc significantly faster? This benchmark should stay in the "fastest paths" of both allocators most often, and they both have very lean fastest paths (they both use pool-local singly-linked sized-segregated free lists, so malloc and free for both should usually amount to just popping or pushing one block off/on the head of the appropriate list). obmalloc's `address_in_range()` is definitely a major overhead in its fastest `free()` path, but then mimalloc has to figure out which thread is doing the freeing (looks cheaper than address_in_range, but not free). Perhaps the layers of indirection that have been wrapped around obmalloc over the years are to blame? Perhaps mimalloc's larger (16x) pools and arenas let it stay in its fastest paths more often? I don't know why, but it would be interesting to find out :-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/