On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
So it seems like a quarter-wave LUT is 27 times faster than sin…


If so then that is great and what I'd expected to achieve originally.

I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out.

Yes, the gist linked above is just your code with minor changes, that was 4-5 times faster. To get to 27 times faster you need to use the integer bit-manipulation scheme that I suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such.

Anyway, it is more fun for you to code up your own version than to try to figure out mine. Just follow the principles and you should get close to that performance, I think. (I'll refine the code later, but don't really have time now)


You just have to make sure that the generated instructions fills the entire CPU pipeline.

What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know.

Yes, you have to look up information about the CPU in your computer. Each core has a set of "lanes" that are computed simultanously. Some instructions can go into many lanes, but not all. Then there might be bubbles in the pipeline (the lane) that can be filled up with integer/bit manipulation instructions. It is tedious to look that stuff up. So, last resort. Just try to mix simple integer with simple double computations (avoid division).


Are you saying that I did not have enough instructions that the pipeline could take advantage of?

Yes, you most likely got bubbles. Empty space where the core has nothing to send down a lane, because it is waiting for some computation to finish so that it can figure out what to do next.

Basic optimization:

Step 1: reduce dependencies between computations

Step 2: make sure you generate a mix of simple integer/double instructions that can fill up all the computation lanes at the same time

Step 3: make sure loops only contain a few instructions, the CPU can unroll loops in hardware if they are short. (not valid here though)


Of course, a lot of that might simply be due to LDC and I wasn't able to determine this.

I think I got better performance because I filled more lanes in the pipeline.


Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign).

Yes, as I mentioned, the first bit of the phase is the sign and the second bit of the phase is the reversion of the indexing.


Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic.

It adds perhaps 2-5 cycles or so, my guessing.


exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x})
[...]
With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation

I think you need to do something with the x before you look up, so that you have some kind of fast nonlinear mapping to the indexes.

But for exp() you might prefer an approximation instead, perhaps polynomial taylor series perhaps.

Searching the web should give some ideas.


It seems you already have the half-sin done.

I did the quarter sin though, not the half-sin (but that is almost the same, just drop the reversion of the indexing).

(Let's talk about this later, since we both have other things on our plate. Fun topic! :-)

Reply via email to