On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote: > That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd > sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I > guess is the reason it's an intrinsic in the first place). I would have > expected this to be much, much faster than a user function. Anyone care > enough to check the generated assembly?
The opcode is fairly slow anyway (as far as opcodes go) - odds are the implementation inside the processor is similar to Jerome's method, and the main savings come from it loading fewer bytes into the pipeline. I remember a line from a blog, IIRC it was the author of the C++ FQA writing it, saying hardware and software are pretty much the same thing - moving an instruction to hardware doesn't mean it will be any faster, since it is the same algorithm, just done in processor microcode instead of user opcodes. -- Adam D. Ruppe http://arsdnet.net