Quoting Brian Beesley <[EMAIL PROTECTED]>: > There's another effect here which might get overlooked - and which > Either way there's a bottleneck which is at the very least potential - > between L1 and L2 in Intel systems, and between L2 and RAM in AMD > systems. I believe this bottleneck may dominate threaded LL test > performance, just as it does in single CPU systems when the memory > bandwidth is less than the CPU may demand.
Yes, this will probably be critical in a multithreaded LL squaring. The only way around it is to arrange the code so that only a few cores are hitting main memory heavily, while the other cores handling the other chunks of the FFT already have their datasets in cache and crunch away on them. If memory serves, prime95 already almost completely overlaps main memory latency with useful work, so one possibly more straightforward way to multithread the FFT is to add a third, horribly complicated pass where all the fine-granularity multiprocessing and nonlocal memory access takes place, then recurse to a largely intact version of the current code once things fit within a given size working set. It will probably only be a win if the third pass does a significant fraction (~30% :) of the total work. jasonp ------------------------------------------------------ This message was sent using BOO.net's Webmail. http://www.boo.net/ _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
