I was surprised to still see a bit of this in practice on an 8x Core2 system 
with the example from our paper:

f := (1 + x + y + 2*z^2 + 3*t^3 + 5*u^5)^12:
g := (1 + u + t + 2*z^2 + 3*y^3 + 5*x^5)^12:

What happens here is that we construct the result one term at a time, and 
doing that requires a lot of bandwidth.  We generally assume the cores share 
a cache, i.e. they can transfer data to each other at least as fast than all 
of them (in aggregate) can transfer data to memory.  This is true of the 
Core i3/i5/i7 and AMD's cpus.  On the 8 x Core2 you have 2 cores sharing an 
L2, then two of those fused together on a multi-chip-module, and two MCMs 
connected on the motherboard.  So there are a lot of different level of 
interconnect and the last one is especially slow.  In this case, an 
algorithm that works on a shared data structure in memory will scale better, 
and this is also true of large multi-cpu systems like what TRIP was tested 
on.

So that issue still exists in Maple 15, however it is generally much better 
because memory is recycled.  The larger example on the TRIP page (power=16) 
does not slowdown on our 8 x Core2 machine and performance on Nehalem 
machines (2x4 cores) is very good.  Performance on the 8-core Nehalem EX 
makes me want to buy one.  Anyways, you can see we have really bet on 
manycore versus SMP.  Mainly because I can't see how to do more complicated 
algorithms (like division or powering) in SMP without using an order of 
magnitude more memory.  It's a one-size fits all algorithms and most cpus 
approach.

As for Giac, thanks for the update.  I look forward to timing it for our 
next paper :)

-- 
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org

Reply via email to