On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
OK I changed the approach using a multidimensional array for the matrix so I could ditch arithmetic operations altogether, but curiously after measuring a few thousand runs of both implementations through avgtime, I see no noticeable difference. Why?

Truthfully, because you'll need tens of millions or hundreds of millions in length to determine if it makes a difference and how much. Addition, subtraction and simple memory lookups take very little time, and since the entire array (100 bytes) fits in the cache, it is going to perform very very very well regardless if you can optimize it further.

If you tested this on a much slower system, say an 8bit 6502 the differences would be far more pronounced, but not that much different.

Since the algorithm is more or less O(n) optimizing it won't make many differences.

It's possible you could get a speedup by making them ints instead of chars, since then it might not have a penalty for the 'address not divisible by 4' that applies (which is more for ARM architectures and less for x86).

Other optimizations could be to make it multiple levels, taking the basic 100 elements and expanding them 2-3 levels deep in a lookup and having it do it in more or less a single operation. (100 bytes for 1 level, 10,000 for 2 levels, 1,000,000 for 3 levels, 100,000,000 for 4 levels, etc), but the steps of converting it to the array lookup won't give you that much gain, although fewer memory lookups but none of them will be cached, so any advantage from that is probably lost. Although if you bump up the size to 16x10 instead of 10x10, you could use a shift instead of *10 which will make that slightly faster (there will be unused empty padded spots)

In theory if you avoid the memory lookup at all, you could gain some amount of speed, depending on how it searches a manual table, although using a switch-case and a mixin to do all the details it feels like it wouldn't give you any gain...

Division operations are the slowest operations you can do, but otherwise most instructions run really fast. Unless you're trying to get it smaller (fewer bytes for the call) or shaving for speed by instruction cycle counting (like on the 6502) i doubt you'll get much benefit.

Reply via email to