At 09:54 AM 9/14/2006, George Woltman wrote: >1) Give each thread a contiguous set of data blocks to crunch through. >Thread 1 gets blocks 0 - 127, thread 2 gets 128 - 255, etc. > >2) Have the 8 threads start on data blocks 0-7. Put a lock around the >carry code such that it insures that data blocks are processed in order. > >I lean toward the second solution. Does anyone have any real-world >experiences or alternative suggestions?
Upon further reflection, I think the first solution is the only attractive one. Both methods should have the same theoretical throughput, but when there isn't one CPU for each thread, then the first solution is better. Here's why: Say there are 7 available CPUs for 8 threads. Threads 0 to 6 start processing blocks 0-6 and prefetching blocks 8-14. Now all 7 threads stall and the thread processing block 7 is run destroying all that nice prefetching we did. My fear with the 2nd proposed solution that I was originally considering is that the serialization done at the carry propagation step will give the OS more opportunities to preempt one of our threads destroying the benefits of prefetching. I implemented multi-threading in pass 2 and it is working great. The true data independence made it a pretty easy chore. I've not started multi-threading in pass 1 as I'm still looking for better solutions. It would be nice if the algorithm worked well no matter how many CPUs are available. So far no luck. If there are 8 CPUs and 8 threads, then as soon as one CPU is busy with other work throughput is cut in half. Yuk. _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
