I have made some modifications to the division code so that it will run
faster on penryn.

It got a bit crazy in that I was essentially using the same code as GMP for
the basecase but was still 10% slower no matter what I did.

I notice they use different compiler flags for core2/penryn (k8 flags in
fact). That didn't seem to be the problem.

I also noticed that between 1-15 limbs their addmul_1/submul_1 is faster by
as much as 10-20%. Jason's code was sometimes a little faster for larger
sizes, but the division code is critically dependent on submul_1 for small
sizes.

I have therefore switched to using the GMP submul_1.asm on core2 and penryn
(it might be faster on other platforms too, I didn't check).

Brian, the code is in mpn/x86_64/core2/addmul_1.asm and
mpn/x86_64/core2/submul_1.asm if you are interested in it for Windows. The
files are identical except for add <-> sub.

Anyway, the 8192 x 4096 division in mpir_bench is now identical speed to
GMP.

Unfortunately the changes I made to the division basecase code slow it down
slightly on k10. But not enough to be a problem. We still win there.

Even after all this work, speed shows our basecase divapprox to be 10%
slower than GMP's on penryn. But this is using very close to the same code
in the relevant range, and I have spent more than a day trying to figure it
out. It's not:

* memory allocation (there is none)
* a problem with speed
* compiler flags
* some assembly function that is slower in MPIR
* a major inefficiency in the code

Anyway, at least the benchmark is happy now.

Bill.

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mpir-devel+unsubscr...@googlegroups.com.
To post to this group, send email to mpir-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/mpir-devel.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to