André and Kenneth spent some time working on the OGA2 spec in advance of posting to the list so the could get their thoughts together and make sure to present something reasonably coherent and detailed. One of the things we now want help with is further clarifying our basic intent, after which we can move on to more and more specific implementation details.
One of the design details that seems to be hard to present is the MIMD architecture. At first glance, it looks like a SIMD architecture. But all of you are right to point out that shader workloads are primarily scalar. We might include vector instructions if we find them helpful, but they'll just translate into optimized use of the scalar ALUs. We'll have three basic datatypes: float32, int32, and uint32. (And int32 and uint32 will mostly be treated identically except in cases of mult, div, and conversions.) All other data types will simply be converted on the way in/out of some other resource. Since we have a successful MIPS-like architecture in HQ, we'll just extend this (conceptually). The MIPS architecture, lacking things like carry/overflow/etc flags is just simply easier to implement. Those instances where we end up requiring a couple extra instructions are a worthwhile tradeoff to allow us to have only the register file and program counter in the active thread context. The programmer's view of the processor is that of each fragment associated with a polygon is processed by some scalar program running on a scalar processor (possibly with some vector optimizations). The compiler and scheduler will then see a sea of processors that can execute this program on each fragment in parallel. Each fragment will get its own task (short-running thread). Since each fragment will refer to a different set of colors and texels, then each task will require its own register file. But there will only be a handful of unique tasks running at the same time. A lot of the time, you'll find thousands of fragments being processed by the same shader program. Moreover, they'll mostly be processed by exactly the same instruction sequence. Thus, it makes sense to economize on program file space. One idea is to run dozens of threads off of the same instruction cache, in groups. As an example, we might assign up to 64 threads to to an 8-wide MIMD processor. Initially, the constraint for the scheduler is that sets of 8 should start with the same program counter. The shared program file (or instruction cache), which allows us to economize on RAM, has an instruction fetched once and issues to 8 different pipelines. The remaining threads will have instructions issued in round-robin fashion also in groups of 8. Should one or more of those programs diverge (a conditional instruction doesn't agree across threads), then similar threads will be split off and scheduled separately as one additional cycle in the round-robin. Memory reads are the worst thing to handle. This is especially true if the read incurs a row miss. Typical GPUs have thousands of active threads. If one thread is stalled waiting on a read, another can take its place. We need to investigate how we can do the same. My personal opinion is that we should optimize for sort-first. Essentially, what we do is section the target surface into rectangular areas that correspond to a memory row. This keeps all writes and read-modify-writes of the target surface confined to a single memory row. The nondeterminstic nature of the scheduling to this sea of processors means that we cannot expect fragments to be processed in some convenient order, as we could with a fixed-function design. Fortunately, confinement to a memory row helps tremendously with that. This confinement to one memory row means that we have a higher probability of a cache hit for a read, and a cache miss is relatively cheap compared to when we have a row miss. With sort-first, we would first identify all triangles that intersect the current rectangle and render everything that would be seen there before moving on to the next one. This makes the target surface very cheap to access. Texel reads could be much more random-access, however. The sort-first will help confine it a bit, and we'll have to take care of the rest with caching and texture compression. I might be remembering this wrong, but with OGD1 and its four memory controllers, one memory row corresponds to 1024 pixels, which could be configured as a 32x32 patch. We need to decide if that's enough. Currently, the bank selects are configured to be the top part of the address. This allows, for instance, alternate reads and writes to different areas of graphics memory (not within the same bank) to be relatively cheap because they avoid row misses. This is why I like to allocate the screen starting at the beginning of memory but off-screen surfaces starting at the END and working their way down. However, if we were to move those two bits into the lower part of the address, we could extend the row length to 4096 pixels, giving us a 64x64 patch to work with. As with Larrabee, I suggest that we provide dedicated texture hardware. We could cache the compressed data and use one decompressor or a small number to service all texel requests. I expect that random access to the graphics memory will have much higher latency than pushing all requests through a small number of decompressors. I'm also leaning toward one or a couple of dedicated rasterizers. Intel's engineers were able to come up with a clever software rasterizer, but our engines are not as powerful and would not be optimized for that task. Software rasterization is known to be inefficient, while efficient hardware implementations are known to be fast but can require a lot of area. We need to figure out just how flexible we need the rasterizer to be. Essentially, it would have to linearly walk a triangle, interpolating several arbitrary parameters (that would later be interpreted as things like coordinates, texture coordinates, colors, etc.). But for however many parameters we can interpolate, we'll usually need fewer, while someone is eventually going to want more. So we need to think about a rasterizer that is optimized specifically for that task but highly flexible, where we can trade off performance for the number of parameters it walks. For instance, it might interpolate groups of four but be able to walk up to 64 of them. In that case, it would take 16 cycles to generate the parameters for one fragment, but that's okay since the shaders are scalar and can't really manipulate all of them at once. The data flow goes something roughly like this: geometry (shaders) -> vertexes (shaders) -> triangulation -> rasterizer -> fragments (shaders) Processing things in patches means that we often won't have that many vertexes corresponding to the area. Something has to sort things into patches, and that could be the host CPU (which would therefore also do the geometry), or it could be the shaders. (Are there special instructions we need to make geometry processing efficient?) Once it's determined which polygons (broken into triangles) overlap the patch, those vertexes can be rasterized into fragments. We have a scheduling problem. One thing we might want to do is perform geometry processing for the whole scene in one pass, storing the intermediate results (vertexes) in graphics memory. This isn't my area, but I'm guessing the next step would be a vertex shading pass, also storing the vertexes in graphics memory. This is worth-while as long as we can keep everything busy. If not, we have to think about getting a head start on fragment processing. If we organize the vertexes properly, then when it's time to do fragment processing, we can have one or two threads of a shader fetching processed vertexes and feeding them to the rasterizer(s). All remaining shaders would have fragments dynamically load-balanced across them. We don't have to totally solve the scheduling issue up front. (No one's yet done a perfect job anyhow.) But we do need to keep it in mind so that we don't end up hacking changes to the shader engine as an after-thought. Ok! Now it's time for everyone to rip apart my arguments. :) Seriously, we need to do a good job at this. With all of this open source brainpower on the problem, that should be a piece of cake, right? -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
