I have made some modifications to the division code so that it will run faster on penryn.
It got a bit crazy in that I was essentially using the same code as GMP for the basecase but was still 10% slower no matter what I did. I notice they use different compiler flags for core2/penryn (k8 flags in fact). That didn't seem to be the problem. I also noticed that between 1-15 limbs their addmul_1/submul_1 is faster by as much as 10-20%. Jason's code was sometimes a little faster for larger sizes, but the division code is critically dependent on submul_1 for small sizes. I have therefore switched to using the GMP submul_1.asm on core2 and penryn (it might be faster on other platforms too, I didn't check). Brian, the code is in mpn/x86_64/core2/addmul_1.asm and mpn/x86_64/core2/submul_1.asm if you are interested in it for Windows. The files are identical except for add <-> sub. Anyway, the 8192 x 4096 division in mpir_bench is now identical speed to GMP. Unfortunately the changes I made to the division basecase code slow it down slightly on k10. But not enough to be a problem. We still win there. Even after all this work, speed shows our basecase divapprox to be 10% slower than GMP's on penryn. But this is using very close to the same code in the relevant range, and I have spent more than a day trying to figure it out. It's not: * memory allocation (there is none) * a problem with speed * compiler flags * some assembly function that is slower in MPIR * a major inefficiency in the code Anyway, at least the benchmark is happy now. Bill. -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to mpir-devel+unsubscr...@googlegroups.com. To post to this group, send email to mpir-devel@googlegroups.com. Visit this group at http://groups.google.com/group/mpir-devel. For more options, visit https://groups.google.com/groups/opt_out.