On 4/17/06, Timothy Baldridge <[EMAIL PROTECTED]> wrote:

>
> A MISC design is going to need two maybe three stages in the pipeline.
> Fetch, and Execute, maybe decode, but maybe not. Data dependency is
> not going to be an issue. It would be a blast programming a compiler
> for this sort of GPU, you could optimize the shaders to death.

Yes, since you are going to compile custom for each revision of OGA,
you can do all the scheduling in the compiler.  This will increase
code size (more NOOPs), but it simplifies the hardware by not
requiring us to implement any interlocks.


> I do have a question though? Does the GPU on the current OGP design
> have direct access to the memory? Or does it contact the video memory
> through a memory controller of sorts.

You cannot contact memory without some sort of memory controller. 
It's got to manage banks, row misses, refresh, etc.  There is no such
thing as "direct access to memory."  Our memory controller, however,
is our own state machine that we can design as we like.

>
> If, somehow we could give the GPU direct access to video memory,
> basically 64MB of registers. Then we would have a design that would
> give some powerful performance benefits. We could then design the MISC
> modules to accept memory locations. So you could say, "multiply 0x0004
> with 0x01004 placing result in 0x02004 executing it 0x0010 times.".

As it turns out, we have an odd case where the memory is at least as
fast as the logic we can afford to control it with.  Modern processors
use lots of registers because memory is a horrible bottleneck.  Our
problem here is that although memory is relatively fast, there's still
a significant latency between request and receipt of read data.  Plus
it's variable (row misses incur extra delays) and non-deterministic
(memory refreshes appear random to the compute engine).

The lesson I learned long ago with memory is to do as much batching as
possible.  Read requests get queued, as to the responses.  That means
the GPU has to be designed to absorb the latency.  OGA has a fifo in
the pipeline that sits between request and receipt stages just for
that purpose.  Writes are queued and forgotten.  For performance, it's
important that reads and writes all be allowed to complete out of
order so that you can perform all accesses for one row before
incurring the penalty to switch to another one.  A sort of "memory
barrier" is used to sync everything up when you need to read what you
just wrote (fortunately a rare event in fixed-function pipelines, at
least).

The MISC approach is interesting, but all you're really doing is
encoding part of the opcode into the register number.  Six of one,
half a dozen of the other.  If it saves you something, do it.  But I
don't think it does.  (With TROZ, I encoded the rendering command into
the address, reducing the number of bus cycles necessary to initiate
drawing.)  Still, there are also plenty of things we could gain by
this approach, including the ability to schedule instructions more
flexibly.  One of the benefits of load-store RISC designs is that an
instruction that adds two memory words and writes the result back to
memory in a CISC design would suffer horribly from memory latency; in
a RISC design, the individual load, ALU, and store instructions can be
scheduled (statically or dynamically) so that they are spread out and
interleaved with other parts of an algorithm, reducing the wasted time
and keeping throughput high.  The MISC idea is something we should
keep in mind when deciding between opcode and special purpose register
for a given operation.

> We find ourselves in a catch 22 here. I'm afraid that a RISC design is
> not going to be fast enough. We'll be trying to push too many
> instructions through the chip too fast. However, a CISC design is not
> going to be much better. We cannot go with Out-of-Order execution
> because of the complexity. But performance is going to suffer unless
> we can execute more than one instruction at a time.

Don't get too caught up in what people tend to mean by CISC and RISC. 
The MIPS processor has a SQRT instruction that takes over 100 cycles
and incurrs all sorts of structural hazzards (resource contention)
against other FP operations.  Our goal should be to design a
straight-forward, simplified instruction set.  That means if we want
to sparsely populate an instruction word, we should do it.  And we
should do all sorts of clever things that implement multiple types of
operations with the same instruction (like how some MIPS opcodes are
really aliases for others where R0 is implicit).  Sorry.  I have MIPS
on the brain.

> But what someone said here was right. We won't know how it works until
> we start trying to program it. That's the wonderful thing about OGP
> right? So when we get the first prototypes out, those of us who feel
> like it can program our own GPU on it.

Yes, experimentation (and some simulation) is what will tell us what
we want to know.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to