On Tue, Jul 9, 2019 at 5:29 PM Inada Naoki <songofaca...@gmail.com> wrote:
>
> On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.pet...@gmail.com> wrote:
> >
> >> I was more intrigued by your first (speed) comparison:
> >
> > > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
> >
> > Now _that's_ interesting ;-)  Looks like spectral_norm recycles many
> > short-lived Python floats at a swift pace.  So memory management
> > should account for a large part of its runtime (the arithmetic it does
> > is cheap in comparison), and obmalloc and mimalloc should both excel
> > at recycling mountains of small objects.  Why is mimalloc
> > significantly faster?
>
> Totally agree.  I'll investigate this next.
>

I compared "perf" output of mimalloc and pymalloc, and I succeeded to
optimize pymalloc!

$ ./python bm_spectral_norm.py --compare-to ./python-master
python-master: ..................... 199 ms +- 1 ms
python: ..................... 182 ms +- 4 ms

Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
4 ms: 1.10x faster (-9%)

mimalloc uses many small static (inline) functions.
On the other hand, pymalloc_alloc and pymalloc_free is large function
containing slow/rare path.

PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free.
But compiler doesn't know which is the hot part in pymalloc_alloc and
pymalloc_free.
So gcc failed to chose code to inline.  Remaining part of
pymalloc_alloc and pymalloc_free
are called and many push/pop are executed because they contains complex logic.

So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
But I need to use
"static inline" for pymalloc_alloc and pymalloc_free yet [1].
Generated assembly is optimized
well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3].
But there are many code duplication in PyObject_Malloc and
PyObject_Calloc, etc...

[1] https://github.com/python/cpython/pull/14674/files
[2] 
https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L232-L274
[3] 
https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L2-L32

I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

Except above, there is one more important difference.

pymalloc return free pool to freepool list soon when pool become empty.
On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc)
not everytime when it's empty [4].  So they can avoid rebuilding linked list of
free blocks.
I think pymalloc should do same optimization.

[4] 
https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e08b3405/src/page.c#L365-L375

BTW, which is proper name? pymalloc, or obmalloc.

Regards,
-- 
Inada Naoki  <songofaca...@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YWNWHGTJUMZ4D34DPRFXECF7O7GRJK2M/

Reply via email to