> At 09:54 AM 9/14/2006, George Woltman wrote:
>
> > 1) Give each thread a contiguous set of data blocks to crunch through.
Thread 1 gets blocks 0 - 127, thread 2 gets 128 - 255, etc.
>
> > 2)  Have the 8 threads start on data blocks 0-7.  Put a lock around the
carry code such that it insures that data blocks are processed in order.
>
> I lean toward the second solution.  Does anyone have any real-world
experiences or alternative suggestions?
> 
> Upon further reflection, I think the first solution is the only attractive
one.
>
> Both methods should have the same theoretical throughput, but when there
isn't one CPU for each thread, then the first solution is better. Here's
why:
> 
> Say there are 7 available CPUs for 8 threads. Threads 0 to 6 start
processing blocks 0-6 and prefetching blocks 8-14. Now all 7 threads stall
and the thread processing block 7 is run destroying all that nice
prefetching we did.
> 
> My fear with the 2nd proposed solution that I was originally considering
is that the serialization done at the carry propagation step will give the
OS more opportunities to preempt one of our threads destroying the benefits
of prefetching.


Perhaps I do not understand enough about this. But if I follow well, a
thread can start with blocks 0-127 while at the same time another one can
start on blocks 128-255... This would mean that the earlier suggestion by
lloyd should be a nice compromise : allocate smaller bunches of blocks : for
instance allocate 8 blocks to each thread (thread 1 gets blocks 0-7, thread
2 gets 8-15...), each thread to finish takes the first free bunch of blocks
when it finishes. Would the number of carries to be stored and to be
processed afterwards be the drawback of this solution ?




_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to