George Woltman wrote: > At 12:52 AM 9/6/2006, Jason wrote: > > >> | | |****************| | | |****************| >> |c| | block i after | propagate |c| | block i+1 after| >> |a| | itwiddle, IFFT | carries |a| | itwiddle, IFFT | >> |r|--> | and IDWT | for ea --> |r|--> | and IDWT | --> ... >> |r| | weights | row |r| | weights | >> |y| |****************| |y| |****************| >> >> The point is that carries can be propagated while each block is being >> processed, until you finish the last block. Then all the carries have >> to "carriage return" to the next row, the last carry wraps around to >> the first row like before, and the carry vector propagates 10 elements >> into the first block >> > > This is the scheme prime95 currently uses - and as you noted it introduces > a data dependence. In pass 2, the blocks are truly independent and I can > dole a data block to each thread as needed. Very easy and efficient. > In pass 1, with the data dependence introduced by carry propagation, > I can see two choices (as an example say there are 1024 data blocks > and 8 threads): > > 1) Give each thread a contiguous set of data blocks to crunch through. > Thread 1 gets blocks 0 - 127, thread 2 gets 128 - 255, etc. Two major > disadvantages: a) if one CPU core is busy with something else then > 7 threads finish and 6 then sit idle waiting for the 8th thread to finish. > b) more memory for carry buffers and more overhead propagating final > carries. > > 2) Have the 8 threads start on data blocks 0-7. Put a lock around the > carry code such that it insures that data blocks are processed in order. > The theory is that all 8 threads race to the carry code, but then the lock > "spaces them apart" such that thread 1 does carries for block 0 and starts > data block 9, then thread 2 starts carries for data block 1 and then starts > data block 10. > The major negative for this method is that there is a limit to the number of > threads that can be kept busy. If the code is spending 3 units of time in > the FFT code and 1 unit of time in the carry code, then only 4 threads > can be kept busy this way. > > I lean toward the second solution. Does anyone have any real-world > experiences or alternative suggestions? >
How bout if you divide up the passes into a LARGE number of blocks, many more than there are threads, say 100 blocks of work for 4 threads on 4 processors then as eash thread finishes a block you give it another block to do. If one thread is starved for cycles it might only do one block but you only wait 1/100th as long to start the next pass. This would somewhat resemble some wide area distributed processing algorithms. This would of course be more overhead for bookkeeping in the "main" task which hands out blocks of work and keeps track of which ones are done but it reduces the waits for completion on unbalanced threads. Its a classic tradeoff. It may need some tuning out in the wilds to find the optimum numbers of blocks per task and so on. _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
