On Tue, Sep 22, 2009 at 9:03 PM, Andre Pouliot <[email protected]> wrote:
> > With the current architecture proposed the load would be the same. Since one > scalar instruction would be done on at least 4 threads at once. > > The architecture would run many kernel(shader programs) at once, on multiple > threads. So while fetching one instruction your executing 4 threads at once. > It may seem counter intuitive but the shader will be executing a set of m > (kernel) * n (threads) at once. The kernel(program) being executed one after > another. The threads being the dataset to process. Since most threads need > to execute the same program they are being processed and controlled by the > same kernel. I think that we need to drop this mention of MIMD, moving the explanation to a later section. Essentially what we're proposing here is a scalar architecture. As an after-thought, we recognize that almost all fragments will use the same kernel and follow the same instruction sequence. We can take advantage of this as a way to optimize the amount of SRAM space used for the instruction store/cache. We group N tasks together to share the same instruction store. As long as they're all fetching the same instruction, we only need a single port on the RAM; we send that same instruction down N independent execution pipelines. This is nothing more than an AREA optimization. Albeit quite a significant one. > Doing so reduce the need of data communication between the different ALU and > reduce the dead time present. One vector operation on 1 thread will take > more time to execute than having a SIMD standard unit, But when executing > multiple thread controlled by the same kernel the overall time to execute a > vector operation will be reduced. Another optimization we make, much more fundamental than MIMD, is that we execute several tasks round-robin on each pipeline (at instruction granularity). (If our MIMD is N wide, and we round-robin M tasks on each, that's N*M total tasks assigned to the block.) This is like Niagara. We eliminate the need for locks or NOOP instructions to deal with instruction dependencies. In fact, the only instructions that would take long enough to actually stall on a dependency are memory reads and divides. We also get some of the benefits of an OOO engine (although not superscalar) without having to add any special hardware to support it. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
