Re: [Open-graphics] Synthesizing oga1hq

Mark Tue, 14 Aug 2007 09:36:00 -0700

Timothy Normand Miller wrote:

On 8/14/07, Farhan Mohamed Ali <[EMAIL PROTECTED]> wrote:

Speaking of multipliers, i was wondering what speed we are targeting for the 
floatmult25 or the entire FPGA in general? I have managed to get a 3 stage version 
working at <9ns according to the tools (targeted for 3S1500 but i'm not sure 
how accurate the auto generated timing constraints are. Seems a bit quirky to me, 
it does not respond as i expect to changes i make (as in, why do changes i make in 
stage1 affect the critical path which is in stage2, and weird stuff like that). 
I'm used to working on full custom ASICs where i control everything, so i don't 
find the synthesizer to be very intuitive :\

It's trying to do the P&R automatically, and it's using simulated
annealing to do it.  It's an optimization problem using randomization.
 So to begin with, what you get isn't deterministic.  But since
there's competition for resources, changing something in one place
will affect everything else.  It can be frustrating sometimes.  We
find that when we're on the edge of being able to meet timing, we'll
have to run P&R several times before it gives us what we want.

In addition to the nondeterminism in simulated annealing (if indeedXilinx uses simulated annealing for placement, which isn't necessarilythe case), XST probably does at least some retiming on your circuit(depending on your synthesis script), which could move critical pathsbetween stages. There's tech mapping, packing, and routing, too -- allof which are heuristic and nonlinear.

If you ask me, the bigger problem is that the whole synthesis flow is"nonlinear", insofar as a small change in the input could result in amassive perturbation in the output. Xilinx's tools (and probablyothers) are deterministic insofar as identical input will always yieldthe same output (I'm fairly sure of this). However, a tiny perturbationin the input (RTL, constraints...) can completely change the finalimplementation.

You can work around the unpredictability in critical portions of thedesign by explicitly instantiating LUTs, carry chains, etc. and lockingthem to specific locations on the part (again, at least with Xilinx).You can use FPGA Editor to rewire a post-PAR design. You don't _have_to use synthesis at all if thecontrol/performance/effort/maintainability/portability trade-off ofdoing full-custom designs looks right.

I think that unless you impose external timing constraints, Xilinx'stools try for the lowest possible delay first and minimize area as asecondary goal (possibly with power optimizations as a tertiary goal inrecent releases). I'd worry less about constraints and more aboutsynthesizing for the right part if you're looking for a reasonabletiming estimate. I'm assuming "the right part" means XC3S4000-fg676-5or LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion.

Back to the XP10, if it doesn't have hard multipliers, we can make our own :) 
But again i'm not sure how well that works out for FPGAs.


Yeah.  It's not worth the extra logic to fully pipeline it (nor could
we keep a 32-stage multiplier pipeline fully fed).  We could have
separate logic that would run in parallel.  Or we could have special
mult-stepping instructions.  With the latter, partial multiplies can
be optimized to take fewer cycles.

Here's what I'm thinking....

If we had a stand-alone multstep instruction, it would need four
operands:  (1) an accumulator, (2) the mutiplicand, (3) one bit from
the multiplier to determine whether or not the multiplicand is added
to the accumulator, and (4) a loop counter from which to compute the
multiplicand left-shift and which bit to take from the multiplier.

Now, I don't like the idea of adding extra state.  What if we want to
add the ability to handle interrupts?  But we can tinker with the
idea:  Have one special instruction whose job is to load the counter
and the multiplier.  The step instruction would have the accumulator
(as a source and the target) and the muliplicand.  Each step would
step the counter, shift the multiplier, and add (or not, depending on
the bit from the multiplier) the shifted multiplicand to the
accumulator.  That puts a shifter in line with an adder, though, so
maybe we want to load the multiplier and multiplicand in the first
instruction (so they're shifted 1 each cycle) and then specify the
counter and accumulator in the step?  We'll have to work out the
permutations.

BTW, the only reason to do this is because without it, a multiply
would take at least 4 times longer due to the overhead of explicit
shifts and branches.  Do we care?

I would imagine Lattice's tools still infer multipliers and I believethere is a fast(ish) multiplier implementation for the Lattice XParchitecture using LUTs and carry chains (it's alluded to in the datasheet).

A fixed one-bit shift after the adder should be so cheap as to make noodds. It's just a two-input MUX -- it'll fit in one LUT, possiblypacked with something else.

All of these implementation ideas seem to be very complex as compared tosimply blocking the pipeline. This nanocontroller isn't going to be anOOO ILP-exploiting powerhouse anyhow. Why not just pipeline themultiplier enough to maintain the clock speed you're shooting for andlet that determine the number of cycles that the multiply instructionblocks the ALU stage of the pipeline? This eliminates polluting theinstruction set with odd instructions, is straightforward to implement,and won't slow down the instruction stream any more than any othermulticycle multiplication would (plus code size would be negligiblysmaller).

A slightly more complex solution that could significantly pick up ILPwould be to let the pipeline proceed and block only when themultiplication result is used (no avoiding that) or anothermultiplication is issued (assuming the multiplier is a simple serialimplementation and not fully pipelined). Whoever's writing assembly (orthe assembler itself) could try to schedule multiplications so that theblocking condition occurs rarely, if at all. Or, you could simplyeliminate the hazard-detection logic altogether and presume that thecoder or toolchain is dealing with it.

It should still be feasible to implement an early-exit for narrowmultiplications.

Another approach might be to consider moving the nanocontroller to theXilinx part to leverage the hard multipliers. What's the interfacebetween the Spartan and the XP? Is it possible to leave the DMA on theXP and have the nanocontroller on the Spartan? Or to move both to theSpartan? How is the system partitioned between the two FPGAs?

What kind of code will the nanocontroller be running? Who'll be writingcode for it? Will it be written in a high-level language or strictly inassembly? What kind of throughput is required (or, in other words,where does the 10ns constraint come from)? Is there any way to figureout how useful a 32x32 multiply is before worrying about how toimplement it?

(Is all this documented somewhere? If so, please forgive my ignoranceand pass on the URL.)

I'm going to try to get a feel for the cost of multipliers on theLattice part. They haven't sent me a Synplify license yet, though.

Incidentally, sorry for coming out of nowhere with all this. I've beenlurking on the list for a couple of months and just spotted somewhere Ifelt I could chime in helpfully. I'm a grad student at the Universityof Toronto interested in helping however I can. If I'm in breach of anyetiquette, please let me know. Cheers,


Mark.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Synthesizing oga1hq

Reply via email to