Step one, of course, is to develop a preliminary GPU architecture to make a simulator of. We have something to start with, but it should also remain in flux, because we need it to understand the impact of architectural design decisions.
I have a friend who is currently doing research on GPUs, and he could really benefit RIGHT NOW from a detailed and accurate GPU simulator. I could use it eventually for power modeling, whereas his need is more reliability-oriented. If we have as a long-term goal a fabbable design for a GPU, then we're going to need a simulator, before and after we have a complete design. For the OGP, this would be the basis for us making architectural decisions. But if we make it somewhat general and parameterized, it could be used around the world by people doing research on GPU architecture. FOSS means different things for different people. But these things general include freedom, usefulness, impact on the world, and mindshare. By setting up a realistic SOFTWARE goal for this project, we can contribute to all of those, without the sort of monetary expense that posed too much of a challenge the first time around. If as far as we get is a useful GPU simulator, we'll have done some good for the world, and we'll be a success (arguably for a second time, then we can make it a third). So, let's talk engineering now. I'll assert (in order to generate discussion) that a GPU needs to be divided into the following components: - A geometry processor * - A vertex processor * - A rasterizer (or set thereof) - A fragment processor * - A process scheduler (to distribute work to compute engines) - A texture processor (mostly for decompressing pre-computed textures) - A memory system for storing drawing surfaces and textures (possibly with some amount of caching) * Comprised of an array of compute engines. Whether or not the compute engines for each task are all identical (and/or dynamically partitioned) is a design parameter (i.e. an architectural characteristic to be examined). Whether or not the compute engines are vector or scalar is a design parameter. (I argue that scalar designs will get better internal utilization and allow more engines to be fit into the same die area. A counter argument to consider is that with too many engines, we may be unable utilize them all, i.e. poor external utilization.) Larrabee demonstrated that rasterization is challenging for a general-purpose processor, suggesting that we may want dedicated hardware. Moreover, the only specialized hardware in Larrabee was a texture engine, because they knew up front that that was too inefficient on a general-purpose processor. In general, GPUs perform identical sets of operations on different data (neighboring pixels), and only under limited circumstances do you get divergence (of flow control). So thread-level parallelism is lined up into WARPS, where neighboring processing pipelines, THREAD PROCESSORS, execute the same instructions, which in practical terms means they all share the same instruction cache. There are actually two dimensions to warps. The horizontal dimension is spatial, corresponding to parallel execution pipelines (like SMP). The vertical dimension is temporal, where instructions from different threads are multiplexed down the same pipeline (like SMT). Some GPUs, as I understand it, switch threads vertically only on memory stalls. For OGA2, the design spec was round-robin issue. Moreover, the spec was to make the vertical warp size exceed the pipeline depth, eliminating all pipeline hazards (ALU, branch, etc.). We had come up with a 9-stage pipeline but needed to make the warp depth 10 for register file multiplexing reasons (IIRC). With a width of 4 horizontally, this would make the warp size 40 for OGA2. (Where 32 is common for mainstream GPUs, so we're not very far off.) OGA2 was spec'd to do sort-first rendering, because it has data locality for the target surface, and we spec'd therefore a data cache that was the exact size of the rendering block (32x32 pixels). However, most GPUs are are sort-last for scalability reasons. This is, therefore, a design parameter. (One that may even be a function at instantiation time of other parameters.) This is some high-level stuff. My suggestion is to make indirect consideration of pipeline structure and start with a purely functional simulator. That is, functional-level shader engines that have parameterized warp sizes, a parameterized number of engines, geometry processing, vertex processing, fragment processing, rasterizer, no texture compression, and simple memory system. Although the compute engine is abstract, we're still simulating a parallel system already. So rather than reinventing the wheel, developing our own event-based system or some such, let's take advantage of existing software. I suggest that we write the functional model in BEHAVIORAL VERILOG. (See the suggestion I posted here: http://stackoverflow.com/questions/3999717/discrete-event-simulators-for-c/9350333#9350333) And then we can use the open source Icarus Verilog simulator to run it. I'm willing to entertain other options for performance reasons, because faster simulators are more useful. (But even better would be to enhance Icarus to be faster. I doubt it does JIT already, and it would be great to add that. More contributions that are useful to many people.) With regard to instruction set, think of it as a data structure, and make it as wide as you want. I don't mean VLIW. I mean that if we need 37 bits initially, let's use 37 bits. Or 49. Or whatever. If our simulator is in C++, then the instruction will actually BE a data structure, even if it corresponds 1:1 with a 32-bit word in a later iteration of the design. We need an appropriate set of OPERATIONS, and we'll solidify that into a more concrete instruction set as we explore the implications. (One advantage of doing this in C++ is that if we change the instruction set, much of the simulator doesn't change, because although the presentation changes, the access methods for the instruction object mostly don't.) P.S., I had an IM discussion with Gary Sheppard, and I'd like to clarify some points. I think we have the potential to beat PowerVR, AMD, and nVidia in terms of performance and energy efficiency. ARCHITECTURALLY. If we're clever enough. Real hardware would be made at the behest of a licensee wanting to embed our design in their SoC. We're defining an ARCHITECTURE. We care only peripherally about things like the transistor technology. A licensee of our design COULD start from a synthesized and placed design, or they could start from the Verilog source. Not our problem, because we're fabless. Also, the patent minefield issue isn't our problem either, because the licensee would have to deal with that. When we care about transistor technology is when we want to prove that our design is more efficient, so we have to synthesize something and simulate it at gate-level based on some standard cell technology. But we don't prescribe it. That's just evidence that we're better. Also, in my imagination, we're targeting embedded systems first and supercomputers second. Not necessarily with the same exact architecture, because 3D and GPGPU have different compromises. One or both will eventually trickle into the desktop. P.P.S., News Headline: The Open Graphics Project reinvents itself to be completely fabless, with the goal of developing the best GPU architecture, to beat all others in terms of efficiency, performance, and freedom. (I know. I'm being over the top. But I feel like something really impactful is doable now.) There. Now that we don't have finances as a barrier to entry, we can really succeed. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
