I had work in a team that design a cpu for a complexe analog interface
(for flash array memory). The cpu was a 8 bits 64 register 3-stage risc
cpu. It replace a very complexe FSM. The size of the block go from ~10000
gates to less than 3000 (include the rom for the instruction).

The cpu was a very very simple one:
1st stage : fetch instruction + read register port,
2nd stage : alu,
3rd stage write back.

There is no logic for pipeline hasard so we have a "natural" delay slot
for branch. Half of the register bank was true memory location, the other
half was output or input of the block. There was no load/store unit, only
memory. There was a shadow register for call (kind of stack for return
address but only for 1 call).

The result was small and very very fast.

Nicolas Boulay

>> Yes, but we had this clever idea of unifying the two.  This way,
>> there's no need for a special instruction for full-size immediate
>> constants, for instance.  (But just because it's cute doesn't mean
>> it's a good idea.)
>
> Just use a second blockram for contant storage.  The blockram gets
> loaded when the FPGA is loaded, and any of it that you're not using
> for contants can be used for variables.
>
> Perhaps the constants could be shared with the instruction blockram,
> provided that you never use two contants as both operands of a
> single instruction.  This is easiest if you restrict the contants
> to being source operand B.
>
>> Any given instruction can do two reads at the same time, followed by a
>> write.  Include instruction fetch.  Overlap that with four other
>> instructons also in the pipeline, and that's a lot of memory activity.
>
> That's not how pipelining works.  Not counting the instruction fetch
> (because that should come from a separate RAM), there are only three
> data RAM accesses per cycle, two reads and a write.  When an instruction
> is in the pipeline, it only gets its two data fetches while it is in
> the fetch stage, and it only gets its write when it is in the writeback
> stage.  So even though there are multiple instructions in flight, only
> one is fetching register operands and only one is writing results
> back on any given clock cycle.  That's why you only need a three-port
> register file.
>
> Another way to look at it is that if you do an "add r1, r2, r3"
> instruction, it may be in the pipeline for three or four clocks, but
> it only fetches its source operands once, and it only writes its
> results back once.
>
>> As you start adding ports, you might as well just use random logic,
>> which is one bit per CLB.
>
> No, because with a three-port (2-read, 1-write) register file, you
> get 64 bits per CLB.  If you had to expand that to five ports
> (4-read, 1-write), you still get 32 bits per CLB.
>
>> Doesn't sound very economical.  Also, you're doing this for
>> performance... do we really need the performance?
>
> It's both better performance AND more compact, because it's using
> the FPGA resources the way the designers intended, rather than trying
> to shoehorn in something else.  And yes, a VGA text display that only
> can update at 10 Hz needs more performance.
>
> Eric
>
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>


_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to