ni...@lysator.liu.se (Niels Möller) writes: Hmm. Definitely worth a try. But if we need explicit loads and stores from the structs, we'll not save that many instructions.
x86 can handle operate-from-memory at almost the same cost as operate-from-register. But operate-to-memory is expensive. Load/store architectures typically reach their peak execution throughput only with a mixture of loads and operates. Each iteration needs to load all the values but store only half of them, so for each pair of values load + load + store, compared to mov, xor, and, xor, xor for conditionally swapping using a mask. You might want to explicitly load some of the values into registers. (And perhaps use the "restrict" keyword for the swappable pointers...but I am afraid that we'd be stepping outside the vague semantics of restrict.) > Some measurements with method 4 and 5 are now in. Modern Intel CPUs > like method 5, as I had expected. Nice! With a few % margin over method 3. 8 configs now vouch for method 5. And method 4 got its first "honourable mention"; beagle thinks it is 2nd best. :-) I don't think method 4 will see much use unless we find a way to radically improve the applicability of its large tables. I made table size 2048 the default, just for testing purposes. The next smaller table size is 512 bytes, which is a more reasonable size. It gets a 87% hit rate. I'd say we need that to get beyond 95% for method 4 to become viable. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel