On 4/17/06, Lourens Veen <[EMAIL PROTECTED]> wrote: > > You can still get high throughput with pipelined functional units. It > doesn't matter much if it takes ten cycles to multiply two numbers (or > vectors of numbers), as long as you can provide two new numbers to > multiply every cycle, and read out the result of the calculation that > started ten cycles ago. Throughput will still be ok (or at least as > good as it gets at the given clock rate). >
One of the things we're forgetting is that static scheduling is way behind the curve, but dynamic scheduling requires lots of extra hardware. Unless we hand-code most of what we run on this or have some massive peep-hole optimizer library, we're always going to get sub-optimal code. The only way to keep the computing units busy with a new fragment every cycle is to avoid data dependency hazards. We can only do that if we can overlap the processing for different fragments (like threads). Then we have to keep track of multiple processor states. Only slightly related, the statistics I have on branch delay slots say that they're only fillable about 60% of the time and they're only useful to the computation about 80% of the time when they're filled, making delay slots only useful about 50% of the time. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
