On Tue, August 14, 2007 12:20 pm, Mark said: > Timothy Normand Miller wrote: >> On 8/14/07, Farhan Mohamed Ali <[EMAIL PROTECTED]> wrote: >>> Speaking of multipliers, i was wondering what speed we are targeting >>> for the floatmult25 or the entire FPGA in general? I have managed to >>> get a 3 stage version working at <9ns according to the tools >>> (targeted for 3S1500 but i'm not sure how accurate the auto >>> generated timing constraints are. Seems a bit quirky to me, it does >>> not respond as i expect to changes i make (as in, why do changes i >>> make in stage1 affect the critical path which is in stage2, and >>> weird stuff like that). I'm used to working on full custom ASICs >>> where i control everything, so i don't find the synthesizer to be >>> very intuitive :\ >>> >> It's trying to do the P&R automatically, and it's using simulated >> annealing to do it. It's an optimization problem using randomization. >> So to begin with, what you get isn't deterministic. But since there's >> competition for resources, changing something in one place will affect >> everything else. It can be frustrating sometimes. We find that when >> we're on the edge of being able to meet timing, we'll have to run P&R >> several times before it gives us what we want. >> > In addition to the nondeterminism in simulated annealing (if indeed > Xilinx uses simulated annealing for placement, which isn't necessarily > the case), XST probably does at least some retiming on your circuit > (depending on your synthesis script), which could move critical paths > between stages. There's tech mapping, packing, and routing, too -- all > of which are heuristic and nonlinear. > > If you ask me, the bigger problem is that the whole synthesis flow is > "nonlinear", insofar as a small change in the input could result in a > massive perturbation in the output. Xilinx's tools (and probably others) > are deterministic insofar as identical input will always yield the same > output (I'm fairly sure of this). However, a tiny perturbation in the > input (RTL, constraints...) can completely change the final > implementation. > > You can work around the unpredictability in critical portions of the > design by explicitly instantiating LUTs, carry chains, etc. and locking > them to specific locations on the part (again, at least with Xilinx). You > can use FPGA Editor to rewire a post-PAR design. You don't _have_ to use > synthesis at all if the > control/performance/effort/maintainability/portability trade-off of doing > full-custom designs looks right. > > I think that unless you impose external timing constraints, Xilinx's > tools try for the lowest possible delay first and minimize area as a > secondary goal (possibly with power optimizations as a tertiary goal in > recent releases). I'd worry less about constraints and more about > synthesizing for the right part if you're looking for a reasonable timing > estimate. I'm assuming "the right part" means XC3S4000-fg676-5 or > LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion. > You are right, a fixed input generates a fixed output. It's just that even small changes in the RTL change the synthesis results in unpredictable ways, and i was wondering whether i was doing something wrong, but apparently that's just the way it works. I did not set any constraints, as i'm just learning to use the tools. I will be of more help when it comes to the ASIC version, i hope.
>>> Back to the XP10, if it doesn't have hard multipliers, we can make >>> our own :) But again i'm not sure how well that works out for FPGAs. >>> >>> >> >> Yeah. It's not worth the extra logic to fully pipeline it (nor could >> we keep a 32-stage multiplier pipeline fully fed). We could have >> separate logic that would run in parallel. Or we could have special >> mult-stepping instructions. With the latter, partial multiplies can be >> optimized to take fewer cycles. >> >> Here's what I'm thinking.... >> >> If we had a stand-alone multstep instruction, it would need four >> operands: (1) an accumulator, (2) the mutiplicand, (3) one bit from >> the multiplier to determine whether or not the multiplicand is added to >> the accumulator, and (4) a loop counter from which to compute the >> multiplicand left-shift and which bit to take from the multiplier. >> >> Now, I don't like the idea of adding extra state. What if we want to >> add the ability to handle interrupts? But we can tinker with the idea: >> Have one special instruction whose job is to load the counter and the >> multiplier. The step instruction would have the accumulator (as a >> source and the target) and the muliplicand. Each step would step the >> counter, shift the multiplier, and add (or not, depending on the bit >> from the multiplier) the shifted multiplicand to the accumulator. That >> puts a shifter in line with an adder, though, so maybe we want to load >> the multiplier and multiplicand in the first instruction (so they're >> shifted 1 each cycle) and then specify the counter and accumulator in >> the step? We'll have to work out the permutations. >> >> BTW, the only reason to do this is because without it, a multiply would >> take at least 4 times longer due to the overhead of explicit shifts and >> branches. Do we care? >> > > I would imagine Lattice's tools still infer multipliers and I believe > there is a fast(ish) multiplier implementation for the Lattice XP > architecture using LUTs and carry chains (it's alluded to in the > datasheet). > > A fixed one-bit shift after the adder should be so cheap as to make no > odds. It's just a two-input MUX -- it'll fit in one LUT, possibly packed > with something else. > > All of these implementation ideas seem to be very complex as compared to > simply blocking the pipeline. This nanocontroller isn't going to be an > OOO ILP-exploiting powerhouse anyhow. Why not just pipeline the > multiplier enough to maintain the clock speed you're shooting for and let > that determine the number of cycles that the multiply instruction blocks > the ALU stage of the pipeline? This eliminates polluting the instruction > set with odd instructions, is straightforward to implement, and won't > slow down the instruction stream any more than any other multicycle > multiplication would (plus code size would be negligibly smaller). > > A slightly more complex solution that could significantly pick up ILP > would be to let the pipeline proceed and block only when the > multiplication result is used (no avoiding that) or another > multiplication is issued (assuming the multiplier is a simple serial > implementation and not fully pipelined). Whoever's writing assembly (or > the assembler itself) could try to schedule multiplications so that the > blocking condition occurs rarely, if at all. Or, you could simply > eliminate the hazard-detection logic altogether and presume that the > coder or toolchain is dealing with it. > > It should still be feasible to implement an early-exit for narrow > multiplications. > > Another approach might be to consider moving the nanocontroller to the > Xilinx part to leverage the hard multipliers. What's the interface > between the Spartan and the XP? Is it possible to leave the DMA on the > XP and have the nanocontroller on the Spartan? Or to move both to the > Spartan? How is the system partitioned between the two FPGAs? > > What kind of code will the nanocontroller be running? Who'll be writing > code for it? Will it be written in a high-level language or strictly > in assembly? What kind of throughput is required (or, in other words, > where does the 10ns constraint come from)? Is there any way to figure > out how useful a 32x32 multiply is before worrying about how to implement > it? > > (Is all this documented somewhere? If so, please forgive my ignorance > and pass on the URL.) > > I'm going to try to get a feel for the cost of multipliers on the Lattice > part. They haven't sent me a Synplify license yet, though. > > Incidentally, sorry for coming out of nowhere with all this. I've been > lurking on the list for a couple of months and just spotted somewhere I > felt I could chime in helpfully. I'm a grad student at the University of > Toronto interested in helping however I can. If I'm in breach of any > etiquette, please let me know. Cheers, > > Mark. _______________________________________________ Open-graphics > mailing list [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics List service > provided by Duskglow Consulting, LLC (www.duskglow.com) > > _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
