I think one thing that we are missing here is the fact that we are thinking about a GPU design, not a full blown CPU design. Has anyone here ever written a shader before?
Here is my vote, keep it simple. If you go with CISC or anything with a longer pipeline, you are going to have problems with data dependency, long pipelines, A MISC design is going to need two maybe three stages in the pipeline. Fetch, and Execute, maybe decode, but maybe not. Data dependency is not going to be an issue. It would be a blast programming a compiler for this sort of GPU, you could optimize the shaders to death. We have to stick with what is practical. And what will work well. Plus we are limited by the following restrictions: Low clock rate (200-300Mhz?) Small transistor space What ever we make must fit in these two restrictions. I do have a question though? Does the GPU on the current OGP design have direct access to the memory? Or does it contact the video memory through a memory controller of sorts. If, somehow we could give the GPU direct access to video memory, basically 64MB of registers. Then we would have a design that would give some powerful performance benefits. We could then design the MISC modules to accept memory locations. So you could say, "multiply 0x0004 with 0x01004 placing result in 0x02004 executing it 0x0010 times.". We find ourselves in a catch 22 here. I'm afraid that a RISC design is not going to be fast enough. We'll be trying to push too many instructions through the chip too fast. However, a CISC design is not going to be much better. We cannot go with Out-of-Order execution because of the complexity. But performance is going to suffer unless we can execute more than one instruction at a time. But what someone said here was right. We won't know how it works until we start trying to program it. That's the wonderful thing about OGP right? So when we get the first prototypes out, those of us who feel like it can program our own GPU on it. Timothy On 4/17/06, Timothy Miller <[EMAIL PROTECTED]> wrote: > On 4/17/06, Lourens Veen <[EMAIL PROTECTED]> wrote: > > > > > You can still get high throughput with pipelined functional units. It > > doesn't matter much if it takes ten cycles to multiply two numbers (or > > vectors of numbers), as long as you can provide two new numbers to > > multiply every cycle, and read out the result of the calculation that > > started ten cycles ago. Throughput will still be ok (or at least as > > good as it gets at the given clock rate). > > > > One of the things we're forgetting is that static scheduling is way > behind the curve, but dynamic scheduling requires lots of extra > hardware. Unless we hand-code most of what we run on this or have > some massive peep-hole optimizer library, we're always going to get > sub-optimal code. > > The only way to keep the computing units busy with a new fragment > every cycle is to avoid data dependency hazards. We can only do that > if we can overlap the processing for different fragments (like > threads). Then we have to keep track of multiple processor states. > > Only slightly related, the statistics I have on branch delay slots say > that they're only fillable about 60% of the time and they're only > useful to the computation about 80% of the time when they're filled, > making delay slots only useful about 50% of the time. > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) > -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. (Stephen Hawking) _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
