On Tue, Sep 22, 2009 at 10:14 PM, Hugh Fisher <[email protected]> wrote: > Andre Pouliot wrote: >> >> Yes all the vector ops will be emitted as scalar ops. The program do get >> longer but scalar ops can be made shorter since we have less instruction >> to >> support. > > SIMD instructions are the same length as scalar instructions on any > sensible CPU: see MIPS and PowerPC. Four scalar MADD instructions > are going to be four times longer than the equivalent vector MADD. > > (Uh, you *are* intending to use fixed length instructions, right? > Please tell me you're not thinking of variable length opcodes?)
No. Fixed-length opcodes. There are two benefits to having vector ops. One is throughput. I'm still convinced by my own earlier argument about aggregate parallelism for a given die area. I'd like to see a counterargument. The other is code size. I don't know how many different kernels will be typically needed for a single scene. If it's a lot, then smaller code size will definitely help with the instruction cache misses. If it's only a couple of kernels, but they're long, same thing. I need more info on this. > >> We are doing only one fetch to execute on multiple data since most >> of the data is controlled by the same program, we parallelize the data set >> but we consider each result independently from the others executed at the >> same time(different threads). The organisation of memory would be >> essentially the same between a SIMD or our current architecture. Both >> require 256 bits memory acces for a add operation and a 128 bits memory >> write. Control is also the same the FPGA don't allow memory wider than 32 >> bits port access with a single memory block. Because of those requirement >> either the current architecture or a SIMD one would require 2 memory bloc >> by >> ALU. The connection is mostly wire no read ahead for the data. > > OK, I see the point. But won't a SIMD design be much easier to > speed up when the port width increases to 64/128 bits in a future > version? What about keeping the same port width and just adding more cores? >> Those optimization were to improve 3D rendering and scientific processing >> on >> a general purpose processor. You don't have the same requirement and >> workload as a GPU. Different problem and context require different >> solution. > > It's exactly the same requirements and workload! 3D vertices have to be > multipled by a 4x4 transform matrix. Doesn't matter whether it's on the > CPU or GPU. If the number of vertexes is small, more single-thread performance will make a difference, but shading all vertexes will take proportionally less time relative to the time to shade all vertexes. If the number of vertexes is large, then we can exploit the available parallelism and will likely saturate the memory bandwidth anyhow. So we come back to having only a code size argument. Shrinking the code by a factor of 3 could be a HUGE win. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
