Timothy Normand Miller wrote:
On 8/14/07, Farhan Mohamed Ali <[EMAIL PROTECTED]> wrote:
Speaking of multipliers, i was wondering what speed we are targeting for the 
floatmult25 or the entire FPGA in general? I have managed to get a 3 stage version 
working at <9ns according to the tools (targeted for 3S1500 but i'm not sure 
how accurate the auto generated timing constraints are. Seems a bit quirky to me, 
it does not respond as i expect to changes i make (as in, why do changes i make in 
stage1 affect the critical path which is in stage2, and weird stuff like that). 
I'm used to working on full custom ASICs where i control everything, so i don't 
find the synthesizer to be very intuitive :\

It's trying to do the P&R automatically, and it's using simulated
annealing to do it.  It's an optimization problem using randomization.
 So to begin with, what you get isn't deterministic.  But since
there's competition for resources, changing something in one place
will affect everything else.  It can be frustrating sometimes.  We
find that when we're on the edge of being able to meet timing, we'll
have to run P&R several times before it gives us what we want.

In addition to the nondeterminism in simulated annealing (if indeed Xilinx uses simulated annealing for placement, which isn't necessarily the case), XST probably does at least some retiming on your circuit (depending on your synthesis script), which could move critical paths between stages. There's tech mapping, packing, and routing, too -- all of which are heuristic and nonlinear.

If you ask me, the bigger problem is that the whole synthesis flow is "nonlinear", insofar as a small change in the input could result in a massive perturbation in the output. Xilinx's tools (and probably others) are deterministic insofar as identical input will always yield the same output (I'm fairly sure of this). However, a tiny perturbation in the input (RTL, constraints...) can completely change the final implementation.

You can work around the unpredictability in critical portions of the design by explicitly instantiating LUTs, carry chains, etc. and locking them to specific locations on the part (again, at least with Xilinx). You can use FPGA Editor to rewire a post-PAR design. You don't _have_ to use synthesis at all if the control/performance/effort/maintainability/portability trade-off of doing full-custom designs looks right.

I think that unless you impose external timing constraints, Xilinx's tools try for the lowest possible delay first and minimize area as a secondary goal (possibly with power optimizations as a tertiary goal in recent releases). I'd worry less about constraints and more about synthesizing for the right part if you're looking for a reasonable timing estimate. I'm assuming "the right part" means XC3S4000-fg676-5 or LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion.

Back to the XP10, if it doesn't have hard multipliers, we can make our own :) 
But again i'm not sure how well that works out for FPGAs.


Yeah.  It's not worth the extra logic to fully pipeline it (nor could
we keep a 32-stage multiplier pipeline fully fed).  We could have
separate logic that would run in parallel.  Or we could have special
mult-stepping instructions.  With the latter, partial multiplies can
be optimized to take fewer cycles.

Here's what I'm thinking....

If we had a stand-alone multstep instruction, it would need four
operands:  (1) an accumulator, (2) the mutiplicand, (3) one bit from
the multiplier to determine whether or not the multiplicand is added
to the accumulator, and (4) a loop counter from which to compute the
multiplicand left-shift and which bit to take from the multiplier.

Now, I don't like the idea of adding extra state.  What if we want to
add the ability to handle interrupts?  But we can tinker with the
idea:  Have one special instruction whose job is to load the counter
and the multiplier.  The step instruction would have the accumulator
(as a source and the target) and the muliplicand.  Each step would
step the counter, shift the multiplier, and add (or not, depending on
the bit from the multiplier) the shifted multiplicand to the
accumulator.  That puts a shifter in line with an adder, though, so
maybe we want to load the multiplier and multiplicand in the first
instruction (so they're shifted 1 each cycle) and then specify the
counter and accumulator in the step?  We'll have to work out the
permutations.

BTW, the only reason to do this is because without it, a multiply
would take at least 4 times longer due to the overhead of explicit
shifts and branches.  Do we care?


I would imagine Lattice's tools still infer multipliers and I believe there is a fast(ish) multiplier implementation for the Lattice XP architecture using LUTs and carry chains (it's alluded to in the datasheet).

A fixed one-bit shift after the adder should be so cheap as to make no odds. It's just a two-input MUX -- it'll fit in one LUT, possibly packed with something else.

All of these implementation ideas seem to be very complex as compared to simply blocking the pipeline. This nanocontroller isn't going to be an OOO ILP-exploiting powerhouse anyhow. Why not just pipeline the multiplier enough to maintain the clock speed you're shooting for and let that determine the number of cycles that the multiply instruction blocks the ALU stage of the pipeline? This eliminates polluting the instruction set with odd instructions, is straightforward to implement, and won't slow down the instruction stream any more than any other multicycle multiplication would (plus code size would be negligibly smaller).

A slightly more complex solution that could significantly pick up ILP would be to let the pipeline proceed and block only when the multiplication result is used (no avoiding that) or another multiplication is issued (assuming the multiplier is a simple serial implementation and not fully pipelined). Whoever's writing assembly (or the assembler itself) could try to schedule multiplications so that the blocking condition occurs rarely, if at all. Or, you could simply eliminate the hazard-detection logic altogether and presume that the coder or toolchain is dealing with it.

It should still be feasible to implement an early-exit for narrow multiplications.

Another approach might be to consider moving the nanocontroller to the Xilinx part to leverage the hard multipliers. What's the interface between the Spartan and the XP? Is it possible to leave the DMA on the XP and have the nanocontroller on the Spartan? Or to move both to the Spartan? How is the system partitioned between the two FPGAs?

What kind of code will the nanocontroller be running? Who'll be writing code for it? Will it be written in a high-level language or strictly in assembly? What kind of throughput is required (or, in other words, where does the 10ns constraint come from)? Is there any way to figure out how useful a 32x32 multiply is before worrying about how to implement it?

(Is all this documented somewhere? If so, please forgive my ignorance and pass on the URL.)

I'm going to try to get a feel for the cost of multipliers on the Lattice part. They haven't sent me a Synplify license yet, though.

Incidentally, sorry for coming out of nowhere with all this. I've been lurking on the list for a couple of months and just spotted somewhere I felt I could chime in helpfully. I'm a grad student at the University of Toronto interested in helping however I can. If I'm in breach of any etiquette, please let me know. Cheers,

Mark.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to