I've been thinking about what it would take to implement a vector ISA.
 I think we've settled the issue on the ALUs, being scalar, but we
could, in theory, get a 45% reduction in code size by including vector
instructions.

But not so fast.  These vectors are apparently not so conveniently
behaved.  It's not like we're dealing with argb and xyzw all the time
here.  With possibilities like xxyy, we find that we need some way to
indicate how vectors should be swizzled.  Also, there are four sizes
of vector (if you include scalars).  We don't want to be issuing four
microops for a 2-vector.  And we end up with a register allocation
problem, trying to pack things together into aligned groups or else
specify up to 8 source and 4 target registers in one instruction.  All
of these things take more bits, whether that means longer instruction
words or more instructions to get things into the right places.
Either way, that ideal 45% compression is unachievable.

Even if we're damn clever and find some way to get a 25% reduction in
code size (sounds optimistic to me), it does complicate the hardware,
and it also complicates our lives as designers.  Plus, we declare that
we are not afraid to completely rip up and redo the whole low-level
ISA.  Using an intermediate representation like LLVM, something that
supports vector instructions, gives us an ability to adapt to hardware
changes quickly as long as we update the drivers with the hardware.
And in this world of Free Software, we are not afraid to do that
either, although stipulating an intermediate language does make it
plenty convenient for the proprietary folks too, IMHO.

If we're ever going to get this off the ground, we have to make
compromises.  Every little addition needs to be justified.  We should
even question the obvious.  You don't NEED an OR instruction.  You can
get by with AND and XOR.  I'm not saying that we should be that
extreme, but we should be thinking about how we can trade off hardware
complexity for software complexity.  Turning 30 instructions into one
(e.g. divide) is a win.  Turning two instructions into one isn't so
interesting.

If you want to critique something, look at the instruction set that
Andre and Kenneth have defined and see if you can find a way to get
the same work done with fewer opcodes.  Fewer opcodes means
potentially more bits in the instruction word for things like more
bits in the field that specify register numbers.  This is just one
example of a worth-while optimization.

One issue that deserves some attention, because it has such a huge
penalty, is memory/texture reads.  If we know in advance that we need
a certain number of consecutive memory words, requesting them all at
once would be a major boost in performance.  So I have a proposal:

When the shader is waiting on reads, it's basically stalled.  The
shader is doing nothing in that thread's timeslot, which means that
its write-back stage is available.  Every instruction have 2 source
operands and one target.  We can use one source as an address and two
bits (or maybe a few more) of the target to specify a length.  The
target specifies the first register that will be filled.  Subsequent
words will be loaded into subsequent registers.  I suggest that we
make room for this in the ISA and consider implementing it in a later
point revision of hardware.

Beyond that, I think we should drop the idea of a vector ISA for the
near term.  If we decide later to add one, it'll be a complete redo of
the whole ISA.  Vectors, of course, will be supported, but unrolled by
the compiler.

I also really like the idea of using the ALU to do divides.  They're
relatively rare instructions.  We could consider not implementing it
at all.  But without predication, it could take a good 200 cycles to
do a 32-bit integer divide.  (I'm sure someone can do better, but this
is for illustration.)  It would also require that every kernel that
does a divide include a subroutine to do it.  On the other hand, if we
were to generalize the ALU a bit and add a state machine to decode, we
could squeeze it into 32 cycles.  To me, it's not obvious which is the
correct answer.  So once again, I propose that we make room for it in
the ISA but not implement it until (possibly) a later point release.
I leave it as a challenge to the reader to come up with a compromise;
figure out minor ALU changes and add a particular form predication,
and you can squeeze the divide into a fixed sequence of 32
instructions (or fewer if the divisor is a constant).  Figure out how
that predication could easily save us cycles under other circumstances
(remember that we have no need for branch prediction, so branches are
cheap), and you're the bee's knees.

There are several things of more immediate practical consideration
(not an exhaustive list):

- Define the functional units and decide exactly how the engine is
pipelined.  I think we should have a fixed-length pipeline with the
FUs arranged in parallel (rather than serially).  The first two stages
would be fetch (icache, branch) and decode (register lookup,
writeback).  The next 6 stages should be all the ALUs.  We can easily
pipeline an FP mul or add into 6 stages.  For other units, we can
pipeline a bit, but we don't really need to; there's just slack.  At
the end, we need a 4-to-1 MUX to select between FP add, FP mul, all
other ALUs (they would be muxed earlier), and memory reads.  (Memory
writes just "disappear" into a queue and have no register write-back,
although a filled queue would stall issue of store instructions.)

- Actually define the bit representation if the ISA.  Technically, we
have 36 bit per instruction to play with.  We can (a) avoid those
bits, (b) we can use them directly, or (c) we can "predecode" some
instructions on the way into the icache so as to make the decode stage
in the pipeline simpler.  I think we should do (a) first and then see
if there's anything obvious we can do later to achieve (c), maybe in a
point release.

- Design the icache (associativity, size, etc)

- Design the thread management logic.  We start out with, for example,
8 sets of four.  When a condition for one thread in a set deviates
from the others, we split it off into another round.

- Sketch the rest of the system, like the global dcache.  We can to
texture compression later, so for now, they're just memory reads.  We
also have one queue from each macroshader to request writes or request
reads, and we have a single global return queue for read data (a large
one, possibly spilling into one really small one per shader).

I'll stop there.  There's tons more, but it's time to look at the
charts and diagrams we have and start drawing some more.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to