I had work in a team that design a cpu for a complexe analog interface (for flash array memory). The cpu was a 8 bits 64 register 3-stage risc cpu. It replace a very complexe FSM. The size of the block go from ~10000 gates to less than 3000 (include the rom for the instruction).
The cpu was a very very simple one: 1st stage : fetch instruction + read register port, 2nd stage : alu, 3rd stage write back. There is no logic for pipeline hasard so we have a "natural" delay slot for branch. Half of the register bank was true memory location, the other half was output or input of the block. There was no load/store unit, only memory. There was a shadow register for call (kind of stack for return address but only for 1 call). The result was small and very very fast. Nicolas Boulay >> Yes, but we had this clever idea of unifying the two. This way, >> there's no need for a special instruction for full-size immediate >> constants, for instance. (But just because it's cute doesn't mean >> it's a good idea.) > > Just use a second blockram for contant storage. The blockram gets > loaded when the FPGA is loaded, and any of it that you're not using > for contants can be used for variables. > > Perhaps the constants could be shared with the instruction blockram, > provided that you never use two contants as both operands of a > single instruction. This is easiest if you restrict the contants > to being source operand B. > >> Any given instruction can do two reads at the same time, followed by a >> write. Include instruction fetch. Overlap that with four other >> instructons also in the pipeline, and that's a lot of memory activity. > > That's not how pipelining works. Not counting the instruction fetch > (because that should come from a separate RAM), there are only three > data RAM accesses per cycle, two reads and a write. When an instruction > is in the pipeline, it only gets its two data fetches while it is in > the fetch stage, and it only gets its write when it is in the writeback > stage. So even though there are multiple instructions in flight, only > one is fetching register operands and only one is writing results > back on any given clock cycle. That's why you only need a three-port > register file. > > Another way to look at it is that if you do an "add r1, r2, r3" > instruction, it may be in the pipeline for three or four clocks, but > it only fetches its source operands once, and it only writes its > results back once. > >> As you start adding ports, you might as well just use random logic, >> which is one bit per CLB. > > No, because with a three-port (2-read, 1-write) register file, you > get 64 bits per CLB. If you had to expand that to five ports > (4-read, 1-write), you still get 32 bits per CLB. > >> Doesn't sound very economical. Also, you're doing this for >> performance... do we really need the performance? > > It's both better performance AND more compact, because it's using > the FPGA resources the way the designers intended, rather than trying > to shoehorn in something else. And yes, a VGA text display that only > can update at 10 Hz needs more performance. > > Eric > > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) > _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
