Let's say we just give each engine 256 registers. (I'd rather it were smaller.) That means that each engine, which can run up to 8 threads (default for now), needs 2048 registers or 4 block RAMs, just for the register file. Also, since the icache is shared across four (we're defaulting to this for now) shaders, that means each shader requires 4.25 BRAMs.
For sort-first, if we want the global dcache to be the same size as one memory row (which is also arbitrary, because we do expect reads to happen to other memory rows, from other surfaces), and we want to give the texture engine the same amount (why not), then that's four BRAMs for the global dcache. We're going to need BRAMs for the memory and video system. Currently (IIRC), OGD1 uses small queues for PCI access to memory, but this is because PCI is really slow compared to the memory system. With an engine in there, we'll need some sort of queueing system. There are four memory controllers, and each one has: - One 64-bit-wide return queue for video head 0 - One 64-bit return queue for video head 1 - One 96-bit write/command queue - One 64-bit read return queue That adds up to 9 BRAMs per controller, totaling 36. I'm assuming that each shader will have one small (distributed RAM) queue for writes and read requests and another for read return. Writes and commands spill from the shader queues into the global one to be processed. We'll have to see how that random logic adds up. It's clear enough to me that we could probably combine some of the queues with the caches. We (probably) have to keep the video queues, but combining the caches with the read return queues, we end up with basically just double the cache space at no extra cost. I'm missing things, I'm sure of it, but just going from these numbers, if we have N BRAMs total, then we can fit in ((N-36)/4.5) shaders. For instance, in the Spartan 3 4000, which has 96 BRAMs, we can fit 13 shader engines. (Assuming the logic fits too.) That's not too bad and will allow us to do a lot of scalability testing. Xilinx has some high-end Spartan 6 boards that IIRC are quite expensive, but we might buy one for some extended testing. The largest one has 268 BRAMs, so we can fit 51 shaders. Of course, all of this assumes that 256 registers is the right number. On Tue, Oct 6, 2009 at 5:53 PM, Hugh Fisher <[email protected]> wrote: > Timothy Normand Miller wrote: >>> >>> The OpenGL spec requires at least 16 4-way vector attributes for >>> vertex shaders, and at least 32 4-way vector varying values for >>> fragment shaders. >> >> So, 128 regs, just for arguments. > > My bad. That's not a simultaneous load, since the vertex and fragment > shaders don't need to run at the same time. (And in a sort-to-tiles > architecture which you seem to be suggesting, all the vertex shading > has to be done before fragment shading.) > > So the max number of incoming arguments is 32 for fragment shaders. > The number of outgoing arguments is <= incoming for vertex shaders, > and much < incoming for fragment shaders. > > And again assuming the target market is not high end gaming or HPC, > you could easily aim for 8 arguments in registers and the rest > passed in slower memory, like the MIPS calling conventions. > >>> For the fixed function pipeline the vertex shader is the more complex >>> one, needing 4 argument vectors and enough working registers for a >>> full matrix x vector transform and Gouraud lighting equation with one >>> light source. >> >> Even in this case, I'm not sure how many scalars it translates into. > > There are sample GLSL shaders around that emulate the standard fixed > function pipeline. I'll go through the code of one and try to figure > this out. > > -- > Hugh Fisher > CECS, ANU > -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
