On 2008-08-15, Timothy Normand Miller wrote:
> I looked.  It's worse than that.  It assumes that you're going to read
> at least 8 words starting on an 8-word boundary.  I designed it under
> the assumption that memory reads would always come in 16-word blocks.
> But I THINK that the real enforced granularity is 8.  Also, the count
> is 7 bits, and the address auto-inc only considers the lower 7 bits.
> (Actually, since it's doing even numbers, only bits [6:1] are
> inc'd/dec'd.)  So I think you could safely read up to 15 blocks of 8
> (120 words), as long as you don't cross a 128-word boundary.

This is going to consume a fair amount of program memory.  IIRC, we may
terminate the read at any point and let the host re-issue a read for the
missing part?  I'm thinking that a practical simplification is to
terminate the read at the next 128-word boundary.  Moreover do we gain
anything by reading more than will fit in the cache, which as far as I
can see is 32 words?  If not, maybe we should cut on a 32-word boundary.

> > Another goofy thing:  It seems tricky at best to unroll the
> > transfer-loop for target write.  The reason is that we only know the
> > number of queued commands, but what we need is the number of queued
> > write-data commands.  Any idea?
> 
> That is tricky, and we may have no good answer for that.
> 
> I suggest we do nothing about it right now.  We should get a working
> revision out, then we can go back later and see if we can do anything
> clever with the CPU design.

Okay, we don't unroll, but we can still avoid sending an address for
each write request.
 
> One thing I've thought of is going to wide instruction word.  Two
> instructions are side-by-side, and you have two register files and two
> ALUs.  On a fetch, you can fetch any two regs from file A and any two
> from file B and then cross them over in any way you like to the ALUs,
> then on writeback, you can do one write to each file.  It's like two
> processors in parallel but with the ability to cross registers over
> between them with restrictions.  Now we can deal with some of the
> inefficiencies, if we can schedule instructions properly.  Obviously
> only one instruction could hold a branch (but it could be either, and
> we could allow it to be both if only one could compute to true).

That would probably help.  We could probably just widen the data path
inside HQ.  We could probably drop the adder from the second ALU, or
replace it with a 16-bit decrementer if needed.  Comparisons for
equality can be done with xor instead of rsub.  However, this implies a
fair deal of rewrite both to RTL and the assembler, and make the HQ more
difficult to program with 3 delay slots and register dependencies, so we
should be sure we really need it and that it will solve the problem.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to