Adam D. Ruppe wrote:
On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd
sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I
guess is the reason it's an intrinsic in the first place). I would have
expected this to be much, much faster than a user function. Anyone care enough
to check the generated assembly?
The opcode is fairly slow anyway (as far as opcodes go) - odds are the
implementation inside the processor is similar to Jerome's method, and
the main savings come from it loading fewer bytes into the pipeline.
I remember a line from a blog, IIRC it was the author of the C++ FQA
writing it, saying hardware and software are pretty much the same thing -
moving an instruction to hardware doesn't mean it will be any faster,
since it is the same algorithm, just done in processor microcode instead of
user opcodes.
It's fast on Intel, slow on AMD. I bet the speed difference comes from
inlining max().