Re: Popcount optimization using AVX512

Andres Freund Fri, 09 Feb 2024 20:33:47 -0800

Hi,

On 2024-02-09 15:27:57 -0800, Noah Misch wrote:
> On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
> > On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
> > > This suggests that finding a way to make the ifunc stuff work (with good
> > > performance) is critical to this work.
> > 
> > Ifuncs are effectively implemented as a function call via a pointer, they're
> > not magic, unfortunately. The sole trick they provide is that you don't
> > manually have to use the function pointer.
> 
> The IFUNC creators introduced it so glibc could use arch-specific memcpy with
> the instruction sequence of a non-pointer, extern function call, not the
> instruction sequence of a function pointer call.


My understanding is that the ifunc mechanism just avoid the need for repeated
indirect calls/jumps to implement a single function call, not the use of
indirect function calls at all. Calls into shared libraries, like libc, are
indirected via the GOT / PLT, i.e. an indirect function call/jump.  Without
ifuncs, the target of the function call would then have to dispatch to the
resolved function. Ifuncs allow to avoid this repeated dispatch by moving the
dispatch to the dynamic linker stage, modifying the contents of the GOT/PLT to
point to the right function. Thus ifuncs are an optimization when calling a
function in a shared library that's then dispatched depending on the cpu
capabilities.

However, in our case, where the code is in the same binary, function calls
implemented in the main binary directly (possibly via a static library) don't
go through GOT/PLT. In such a case, use of ifuncs turns a normal direct
function call into one going through the GOT/PLT, i.e. makes it indirect. The
same is true for calls within a shared library if either explicit symbol
visibility is used, or -symbolic, -Wl,-Bsymbolic or such is used. Therefore
there's no efficiency gain of ifuncs over a call via function pointer.


This isn't because ifunc is implemented badly or something - the reason for
this is that dynamic relocations aren't typically implemented by patching all
callsites (".text relocations"), which is what you would need to avoid the
need for an indirect call to something that fundamentally cannot be a constant
address at link time. The reason text relocations are disfavored is that
they can make program startup quite slow, that they require allowing
modifications to executable pages which are disliked due to the security
implications, and that they make the code non-shareable, as the in-memory
executable code has to differ from the on-disk code.


I actually think ifuncs within the same binary are a tad *slower* than plain
function pointer calls, unless -fno-plt is used. Without -fno-plt, an ifunc is
called by 1) a direct call into the PLT, 2) loading the target address from
the GOT, 3) making an an indirect jump to that address.  Whereas a "plain
indirect function call" is just 1) load target address from variable 2) making
an indirect jump to that address. With -fno-plt the callsites themselves load
the address from the GOT.



Greetings,

Andres Freund

Re: Popcount optimization using AVX512

Reply via email to