On 4/17/06, Timothy Baldridge <[EMAIL PROTECTED]> wrote: > > A MISC design is going to need two maybe three stages in the pipeline. > Fetch, and Execute, maybe decode, but maybe not. Data dependency is > not going to be an issue. It would be a blast programming a compiler > for this sort of GPU, you could optimize the shaders to death.
Yes, since you are going to compile custom for each revision of OGA, you can do all the scheduling in the compiler. This will increase code size (more NOOPs), but it simplifies the hardware by not requiring us to implement any interlocks. > I do have a question though? Does the GPU on the current OGP design > have direct access to the memory? Or does it contact the video memory > through a memory controller of sorts. You cannot contact memory without some sort of memory controller. It's got to manage banks, row misses, refresh, etc. There is no such thing as "direct access to memory." Our memory controller, however, is our own state machine that we can design as we like. > > If, somehow we could give the GPU direct access to video memory, > basically 64MB of registers. Then we would have a design that would > give some powerful performance benefits. We could then design the MISC > modules to accept memory locations. So you could say, "multiply 0x0004 > with 0x01004 placing result in 0x02004 executing it 0x0010 times.". As it turns out, we have an odd case where the memory is at least as fast as the logic we can afford to control it with. Modern processors use lots of registers because memory is a horrible bottleneck. Our problem here is that although memory is relatively fast, there's still a significant latency between request and receipt of read data. Plus it's variable (row misses incur extra delays) and non-deterministic (memory refreshes appear random to the compute engine). The lesson I learned long ago with memory is to do as much batching as possible. Read requests get queued, as to the responses. That means the GPU has to be designed to absorb the latency. OGA has a fifo in the pipeline that sits between request and receipt stages just for that purpose. Writes are queued and forgotten. For performance, it's important that reads and writes all be allowed to complete out of order so that you can perform all accesses for one row before incurring the penalty to switch to another one. A sort of "memory barrier" is used to sync everything up when you need to read what you just wrote (fortunately a rare event in fixed-function pipelines, at least). The MISC approach is interesting, but all you're really doing is encoding part of the opcode into the register number. Six of one, half a dozen of the other. If it saves you something, do it. But I don't think it does. (With TROZ, I encoded the rendering command into the address, reducing the number of bus cycles necessary to initiate drawing.) Still, there are also plenty of things we could gain by this approach, including the ability to schedule instructions more flexibly. One of the benefits of load-store RISC designs is that an instruction that adds two memory words and writes the result back to memory in a CISC design would suffer horribly from memory latency; in a RISC design, the individual load, ALU, and store instructions can be scheduled (statically or dynamically) so that they are spread out and interleaved with other parts of an algorithm, reducing the wasted time and keeping throughput high. The MISC idea is something we should keep in mind when deciding between opcode and special purpose register for a given operation. > We find ourselves in a catch 22 here. I'm afraid that a RISC design is > not going to be fast enough. We'll be trying to push too many > instructions through the chip too fast. However, a CISC design is not > going to be much better. We cannot go with Out-of-Order execution > because of the complexity. But performance is going to suffer unless > we can execute more than one instruction at a time. Don't get too caught up in what people tend to mean by CISC and RISC. The MIPS processor has a SQRT instruction that takes over 100 cycles and incurrs all sorts of structural hazzards (resource contention) against other FP operations. Our goal should be to design a straight-forward, simplified instruction set. That means if we want to sparsely populate an instruction word, we should do it. And we should do all sorts of clever things that implement multiple types of operations with the same instruction (like how some MIPS opcodes are really aliases for others where R0 is implicit). Sorry. I have MIPS on the brain. > But what someone said here was right. We won't know how it works until > we start trying to program it. That's the wonderful thing about OGP > right? So when we get the first prototypes out, those of us who feel > like it can program our own GPU on it. Yes, experimentation (and some simulation) is what will tell us what we want to know. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
