On 9/3/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
>
> I had in mind writing the result first to res_r, i.e. in the above case
> statement, thus my insistence that it have no effect on timing. But if
> we can MUX it on the res_o output as you indicate, that's even better as
> it means we can use the result one cycle earlier. It's worth a try.
I like your idea better. The only problem is that this could collide
with some other write to a register. Restricting what instructions
can go into the slot at this time kinda defeats the purpose.
> > That's what I was thinking, but there's also an argument to be made
> > for using some dedicated instructions. When doing only through I/O,
> > we need to execute three instructions (and use up the corresponding
> > three additional cycles), two to set up the product, and one to fetch
> > the result. When using a dedicated instruction and a register
> > override, we need one extra instruction (one to set up the multiply,
> > zero to fetch it, because it can be an operand directly in another
> > instruction).
>
> If I can put the current line of argument in parentheses for a moment,
> I'd like to consider a more systematic exploitation of register-like
> entities in the spirit of the IO-architecture (or whatever it's called).
> The main deficit of the latter is the number of moves (3) required for
> each operation. This is related to the redundancy of doing address
> calculation and supporting a big address space as if IO-ports were
> memory. What if we replace IO ports with IO "registers", quite
> literally. Let's call them s0, s1, ..., s31 (special registers). Now,
> extend the register number bits in the instruction word from 5 to 6, so
> that we can freely select operands and write-back among {r0, ..., r3,
> s0, ..., r31}. This allows us to store computed values directly into
> IO-ports, not just the multiplication unit, but also DMA and other
> registers.
Well, that's not a bad idea. It's also worth pondering architectures
that have 512 local registers, unifying the scratch space with the
register file. But that may be too radical.
If we extend the register numbers to 6 bits, what does that do to
other bits in the instruction?
> There are several ways to go from here. For instance, say s0 and s1
> connects us to the multiplier. Then a multiply is
>
> move r0, s0 ; first operand
> move r1, s1 ; second operand, triggers multiplication
> ; wait 16-17 cycles
> move s0, r0 ; lower 32 bits of product
> move s1, r1 ; upper 32 bits if we need them
>
> but, it must be noted that each move instruction can be replaced with a
> computation, so in the best case scenario, we don't use any instructions
> on the multiplication itself. However, that may be too optimistic, and
> to make it easier to be efficient we may allow both operands to be
> loaded simultaneously:
Here's another idea: Have two register files that work independently.
Usually, it doesn't make much difference. But on occasion, we may
have a situation where two instructions want to dump results into
registers at the same time. As long as they're allocated properly,
there will be no conflict. In fact, we can do this in groups of 16
registers.
So, rather than using special I/O registers, we just carefully select
target registers for the multiply and for the instruction in the slot
corresponding to when the multiply finishes so that they don't
conflict.
>
> move (r0, r1), (s0, s1) ; we'll use a nicer syntax
> ; wait 16-17 cycles
> move s0, r0 ; lower 32 bits of product
> move s1, r1 ; upper 32 bits of product
>
> Our register file allows two simultaneous reads, so the double source in
> the first instruction can be implemented. Further, the IO-ports are
> independently controlled, and we can use separate buses for even and odd
> port numbers. On the other hand, when writing back the result, we
> respect that our register file only allows one write at a time. This is
> probably not a big loss, since most of the time, we only need the low 32
> bits of the result.
If we're going to have to use separate move instructions to fetch the
result, what is the advantage over fetching them from I/O space?
Forwarding?
> > Yeah. Sort of a "interrupt mode" state bit that is used as the
> > higher-order bit of the register index.
>
> I though about the possibility of dual-threading the nanocontroller. It
> would have some advantages like avoiding the delay-slot, and allowing
> use to pipeline any arithmetic operation over two stages without effect
> on the code. However, the extra pipelining opportunities would probably
> account for less than a factor 2 in speed, so we'd need to utilise the
> threads, meaning we can't afford to let one just wait for interrupts.
Yeah, I thought about that too. There was a time when I was pondering
a DMA controller design that would require more babysitting. I saw no
way around the hyperthreading, you might say. But with a little more
smarts, the DMA controller wouldn't require that, and then we could
get full throughput from the nanocontroller.
--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)