Hi, This is a sort of follow-up for last month discussion about two-word multiplication.
I have been trying another implementation, which has no check for overflow. Most often, it seems to be significantly faster. My code is available here: http://94.23.21.190/publicshare/mul64x2.tar.gz The Makefile is provided, just do make. Execute with these commands: ./mul64x2 bench-umul 10000000 ./mul64x2 bench-smul 10000000 There are several test implementations in my code. There are two points I wanted to investigate: - the impact of my no-overflow unsigned multiplication version - the impact of my sign extension version: use signed shift instead of unsigned shift this saves two subtractions The different multiplication versions are: umul: - naive .... (forget that) - nov ...... my no-overflow - nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP - gmp ...... GMP implementation, for reference smul: - naive .... (forget that) - nov ...... my no-overflow - nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP - novgmp ... (not interesting) nov version, but the sign extension uses unsigned shift, like done in GMP - nov2gmp .. (not interesting) nov2 version, but the sign extension uses if(a<b) c++ - gmp ...... GMP implementation, for reference - gmp2 ..... GMP implementation, but the sign extension uses my signed shift - gmp3 ..... GMP implementation, but the sign extension uses if(a<b) c++ I tested on three different machines: - Core2 Duo P7350, laptop - i7-4790 - Cortex-A9, armv7l, board Odroid-X2 Here are the execution time stats for the different multiplication versions, compared with gmp version: (negative values mean faster than GMP) Processor Core2 Duo P7350, laptop umul: nov ... +0.8% nov2 .. +0.72% smul: nov ... -5% nov2 .. -11% gmp2 .. -0.46% gmp3 .. +14.9% Processor i7-4790 umul: nov ... -7.6% nov2 .. -7.6% smul: nov ... -7.2% nov2 .. -7.2% gmp2 .. +6.7% gmp3 .. -6.2% Processor Cortex-A9, armv7l, board Odroid-X2 umul: nov ... -1.75% nov2 .. -12.3% smul: nov ... -1.93% nov2 .. -0.74% gmp2 .. -3.02% gmp3 .. -4.96% Globally, my versions nov and nov2 are faster. Only on my Core2 Duo processor it is slower but not by much. But... I don't understand some of the results. I don't understand why gmp2 is slower on i7... it should be faster like on my laptop. I don't understand why the perf of nov and nov2 are different on Cortex-A9. Basically, it's just some lines of code that are swapped in the code. Why would the compiler scheduler output code with that much different perf? Also I don't understand why there is such difference with nov2, for smul and umul... Finally, when compiled, my nov and nov2 versions use a bit more processor instructions than the gmp version. So I don't know what would be the perf on other machines. That would depend on whether there is a conditional move/add instruction, etc So, for now I'd just like some comments about my nov or nov2 versions. And happy new year! Adrien _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel