I probably mentioned this already, but your emailer isn't wrapping lines properly.
On 8/14/07, Mark <[EMAIL PROTECTED]> wrote: > Timothy Normand Miller wrote: > > On 8/14/07, Farhan Mohamed Ali <[EMAIL PROTECTED]> wrote: > >> Speaking of multipliers, i was wondering what speed we are targeting for > >> the floatmult25 or the entire FPGA in general? We're hoping to get the GPU to exceed 100MHz in the FPGA. Maybe we can't do that. We'll see. I'm going to pipeline the heck out of it. For instance, I have gotten 8x8 multipliers to exceed 140MHz when I make sure to have a few extra registers packed around it for routing from my logic all the way over to the multiplier and all the way back again. > I have managed to get a 3 stage version working at <9ns according to the > tools (targeted for 3S1500 but i'm not sure how accurate the auto generated > timing constraints I always specify my constraints explicitly. > are. Seems a bit quirky to me, it does not respond as i expect to changes i > make (as in, why do changes i make in stage1 affect the critical path which > is in stage2, and weird stuff like that). I'm used to working on full custom > ASICs where i control everything, so i don't find the synthesizer to be very > intuitive :\ You get used to it, but it can be very frustrating at times. What we really need is a completely Free toolchain so that we can improve on it. :) > If you ask me, the bigger problem is that the whole synthesis flow is > "nonlinear", insofar as a small change in the input could result in a > massive perturbation in the output. Xilinx's tools (and probably > others) are deterministic insofar as identical input will always yield > the same output (I'm fairly sure of this). It is deterministic for the netlist. But sometimes when we're having trouble meeting timing, we'll just run P&R again, and it'll give us different results. > > BTW, the only reason to do this is because without it, a multiply > > would take at least 4 times longer due to the overhead of explicit > > shifts and branches. Do we care? > > > > I would imagine Lattice's tools still infer multipliers and I believe > there is a fast(ish) multiplier implementation for the Lattice XP > architecture using LUTs and carry chains (it's alluded to in the datasheet). Well, it won't be fast in one cycle, and if it's pipelined, it'll be a lot more logic than we can probably justify. > All of these implementation ideas seem to be very complex as compared to > simply blocking the pipeline. This nanocontroller isn't going to be an > OOO ILP-exploiting powerhouse anyhow. Why not just pipeline the > multiplier enough to maintain the clock speed you're shooting for and > let that determine the number of cycles that the multiply instruction > blocks the ALU stage of the pipeline? This eliminates polluting the > instruction set with odd instructions, is straightforward to implement, > and won't slow down the instruction stream any more than any other > multicycle multiplication would (plus code size would be negligibly > smaller). Blocking the pipeline could add a significant amount of overhead. I'm not sure if Lattice registers have Enable inputs, or they may be mutually exclusive with async resets. Adding blocking will increase routing utilization and may also add an extra layer of MUXing to every pipeline stage register. This cuts into our maximum clock rate. > blocking condition occurs rarely, if at all. Or, you could simply > eliminate the hazard-detection logic altogether and presume that the > coder or toolchain is dealing with it. If we were to expand the design to something more complex, we'd want to do all this. The idea here is to have an _extremely_ simple controller. For this design, if there are hazards, we'll let the programmer work them out. > Another approach might be to consider moving the nanocontroller to the > Xilinx part to leverage the hard multipliers. What's the interface > between the Spartan and the XP? Is it possible to leave the DMA on the > XP and have the nanocontroller on the Spartan? Or to move both to the > Spartan? How is the system partitioned between the two FPGAs? All host-interface logic and the ROM memory are handled by the XP10. Everything else by the Xilinx. DMA would be controlled by a set of queues managed by the controller and connected to, among other things, the PCI master state machine. Feeding those queues across the bridge could be problematic. On the other hand, the plan for VGA, which isn't performance critical, was to have the controller pretend (from the perspective of the Xilinx chip) to be the PCI target accessing the graphics memory, so we'd be reusing an existing interface. > What kind of code will the nanocontroller be running? Who'll be writing > code for it? Will it be written in a high-level language or strictly in > assembly? What kind of throughput is required (or, in other words, > where does the 10ns constraint come from)? Is there any way to figure > out how useful a 32x32 multiply is before worrying about how to > implement it? Someone may choose to develop a compiler, but it's probably not worth it with only 512 words of program memory. The throughput has to exceed PCIe 1x, which is roughly equivalent to 32-bit PCI at 66MHz. As the DMA scheduler, its job is to queue up memory requests. For instance, if it were copying from the graphics memory to the host, it would make an interleaved sequence of requests (each request would be for a block of up to 64 words, and multiple would be queued) of reads from the memory system and writes to the host (via the master). Another thing it would do would be to interpret command packets for the GPU, translating them into register accesses. > (Is all this documented somewhere? If so, please forgive my ignorance > and pass on the URL.) Well, there may be some bits and pieces here and there, but mostly it's in the list archives. > I'm going to try to get a feel for the cost of multipliers on the > Lattice part. They haven't sent me a Synplify license yet, though. Isn't all that free? > Incidentally, sorry for coming out of nowhere with all this. I've been > lurking on the list for a couple of months and just spotted somewhere I > felt I could chime in helpfully. I'm a grad student at the University > of Toronto interested in helping however I can. If I'm in breach of any > etiquette, please let me know. Cheers, You are most welcome here. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
