On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
I used to do things like that a simpler way. 3 functions would be created:

  void FeatureInHardware();
  void EmulateFeature();
  void Select();
  void function() doIt = &Select;

I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature().

It costs an indirect call, but if you move it up the call hierarchy a bit so it isn't in the hot loops, the indirect function call cost is negligible.

The advantage is there was only one binary.

It certainly sounds reasonable enough for 99% of use cases. But I'm definitely the 1% here ;-)

Indirect calls invoke the wrath of the branch predictor on XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some more interesting non-processor behaviour, at least on MSVC compilers. The provided auto-DLL loading in that environment performs a call to your DLL-boundary-crossing function, which actually winds up in a jump table that performs a jump instruction to actually get to your DLL code. I suspect this is more costly than the indirect jump at a "write a basic test" level. Doing an indirect call as the only action in a for-loop is guaranteed to bring out the costly branch predictor on the Jaguar. Without getting in and profiling a bunch of stuff, I'm not entirely sure which approach I'd prefer for a general approach.

Certainly, as far as this particular thread goes, every general purpose function of a few lines that I write that use intrinsics is forced inline. No function calls, indirect or otherwise. And on top of that, the inlined code usually pushes the branches in the code out the code across the byte boundary lines just far enough that the simple branch predictor is only ever invoked.

(Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)

Reply via email to