> At 09:54 AM 9/14/2006, George Woltman wrote: > > > 1) Give each thread a contiguous set of data blocks to crunch through. Thread 1 gets blocks 0 - 127, thread 2 gets 128 - 255, etc. > > > 2) Have the 8 threads start on data blocks 0-7. Put a lock around the carry code such that it insures that data blocks are processed in order. > > I lean toward the second solution. Does anyone have any real-world experiences or alternative suggestions? > > Upon further reflection, I think the first solution is the only attractive one. > > Both methods should have the same theoretical throughput, but when there isn't one CPU for each thread, then the first solution is better. Here's why: > > Say there are 7 available CPUs for 8 threads. Threads 0 to 6 start processing blocks 0-6 and prefetching blocks 8-14. Now all 7 threads stall and the thread processing block 7 is run destroying all that nice prefetching we did. > > My fear with the 2nd proposed solution that I was originally considering is that the serialization done at the carry propagation step will give the OS more opportunities to preempt one of our threads destroying the benefits of prefetching.
Perhaps I do not understand enough about this. But if I follow well, a thread can start with blocks 0-127 while at the same time another one can start on blocks 128-255... This would mean that the earlier suggestion by lloyd should be a nice compromise : allocate smaller bunches of blocks : for instance allocate 8 blocks to each thread (thread 1 gets blocks 0-7, thread 2 gets 8-15...), each thread to finish takes the first free bunch of blocks when it finishes. Would the number of carries to be stored and to be processed afterwards be the drawback of this solution ? _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
