I've been thinking about what it would take to implement a vector ISA. I think we've settled the issue on the ALUs, being scalar, but we could, in theory, get a 45% reduction in code size by including vector instructions.
But not so fast. These vectors are apparently not so conveniently behaved. It's not like we're dealing with argb and xyzw all the time here. With possibilities like xxyy, we find that we need some way to indicate how vectors should be swizzled. Also, there are four sizes of vector (if you include scalars). We don't want to be issuing four microops for a 2-vector. And we end up with a register allocation problem, trying to pack things together into aligned groups or else specify up to 8 source and 4 target registers in one instruction. All of these things take more bits, whether that means longer instruction words or more instructions to get things into the right places. Either way, that ideal 45% compression is unachievable. Even if we're damn clever and find some way to get a 25% reduction in code size (sounds optimistic to me), it does complicate the hardware, and it also complicates our lives as designers. Plus, we declare that we are not afraid to completely rip up and redo the whole low-level ISA. Using an intermediate representation like LLVM, something that supports vector instructions, gives us an ability to adapt to hardware changes quickly as long as we update the drivers with the hardware. And in this world of Free Software, we are not afraid to do that either, although stipulating an intermediate language does make it plenty convenient for the proprietary folks too, IMHO. If we're ever going to get this off the ground, we have to make compromises. Every little addition needs to be justified. We should even question the obvious. You don't NEED an OR instruction. You can get by with AND and XOR. I'm not saying that we should be that extreme, but we should be thinking about how we can trade off hardware complexity for software complexity. Turning 30 instructions into one (e.g. divide) is a win. Turning two instructions into one isn't so interesting. If you want to critique something, look at the instruction set that Andre and Kenneth have defined and see if you can find a way to get the same work done with fewer opcodes. Fewer opcodes means potentially more bits in the instruction word for things like more bits in the field that specify register numbers. This is just one example of a worth-while optimization. One issue that deserves some attention, because it has such a huge penalty, is memory/texture reads. If we know in advance that we need a certain number of consecutive memory words, requesting them all at once would be a major boost in performance. So I have a proposal: When the shader is waiting on reads, it's basically stalled. The shader is doing nothing in that thread's timeslot, which means that its write-back stage is available. Every instruction have 2 source operands and one target. We can use one source as an address and two bits (or maybe a few more) of the target to specify a length. The target specifies the first register that will be filled. Subsequent words will be loaded into subsequent registers. I suggest that we make room for this in the ISA and consider implementing it in a later point revision of hardware. Beyond that, I think we should drop the idea of a vector ISA for the near term. If we decide later to add one, it'll be a complete redo of the whole ISA. Vectors, of course, will be supported, but unrolled by the compiler. I also really like the idea of using the ALU to do divides. They're relatively rare instructions. We could consider not implementing it at all. But without predication, it could take a good 200 cycles to do a 32-bit integer divide. (I'm sure someone can do better, but this is for illustration.) It would also require that every kernel that does a divide include a subroutine to do it. On the other hand, if we were to generalize the ALU a bit and add a state machine to decode, we could squeeze it into 32 cycles. To me, it's not obvious which is the correct answer. So once again, I propose that we make room for it in the ISA but not implement it until (possibly) a later point release. I leave it as a challenge to the reader to come up with a compromise; figure out minor ALU changes and add a particular form predication, and you can squeeze the divide into a fixed sequence of 32 instructions (or fewer if the divisor is a constant). Figure out how that predication could easily save us cycles under other circumstances (remember that we have no need for branch prediction, so branches are cheap), and you're the bee's knees. There are several things of more immediate practical consideration (not an exhaustive list): - Define the functional units and decide exactly how the engine is pipelined. I think we should have a fixed-length pipeline with the FUs arranged in parallel (rather than serially). The first two stages would be fetch (icache, branch) and decode (register lookup, writeback). The next 6 stages should be all the ALUs. We can easily pipeline an FP mul or add into 6 stages. For other units, we can pipeline a bit, but we don't really need to; there's just slack. At the end, we need a 4-to-1 MUX to select between FP add, FP mul, all other ALUs (they would be muxed earlier), and memory reads. (Memory writes just "disappear" into a queue and have no register write-back, although a filled queue would stall issue of store instructions.) - Actually define the bit representation if the ISA. Technically, we have 36 bit per instruction to play with. We can (a) avoid those bits, (b) we can use them directly, or (c) we can "predecode" some instructions on the way into the icache so as to make the decode stage in the pipeline simpler. I think we should do (a) first and then see if there's anything obvious we can do later to achieve (c), maybe in a point release. - Design the icache (associativity, size, etc) - Design the thread management logic. We start out with, for example, 8 sets of four. When a condition for one thread in a set deviates from the others, we split it off into another round. - Sketch the rest of the system, like the global dcache. We can to texture compression later, so for now, they're just memory reads. We also have one queue from each macroshader to request writes or request reads, and we have a single global return queue for read data (a large one, possibly spilling into one really small one per shader). I'll stop there. There's tons more, but it's time to look at the charts and diagrams we have and start drawing some more. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
