Re: [Open-graphics] Synthesizing oga1hq

Farhan Mohamed Ali Tue, 14 Aug 2007 10:21:00 -0700

On Tue, August 14, 2007 12:20 pm, Mark said:
> Timothy Normand Miller wrote:
>> On 8/14/07, Farhan Mohamed Ali <[EMAIL PROTECTED]> wrote:
>>> Speaking of multipliers, i was wondering what speed we are targeting
>>> for the floatmult25 or the entire FPGA in general? I have managed to
>>> get a 3 stage version working at <9ns according to the tools
>>> (targeted for 3S1500 but i'm not sure how accurate the auto
>>> generated timing constraints are. Seems a bit quirky to me, it does
>>> not respond as i expect to changes i make (as in, why do changes i
>>> make in stage1 affect the critical path which is in stage2, and
>>> weird stuff like that). I'm used to working on full custom ASICs
>>> where i control everything, so i don't find the synthesizer to be
>>> very intuitive :\
>>> 
>> It's trying to do the P&R automatically, and it's using simulated 
>> annealing to do it.  It's an optimization problem using randomization.
>>  So to begin with, what you get isn't deterministic.  But since there's
>> competition for resources, changing something in one place will affect
>> everything else.  It can be frustrating sometimes.  We find that when
>> we're on the edge of being able to meet timing, we'll have to run P&R
>> several times before it gives us what we want.
>> 
> In addition to the nondeterminism in simulated annealing (if indeed 
> Xilinx uses simulated annealing for placement, which isn't necessarily 
> the case), XST probably does at least some retiming on your circuit 
> (depending on your synthesis script), which could move critical paths 
> between stages.  There's tech mapping, packing, and routing, too -- all 
> of which are heuristic and nonlinear.
> 
> If you ask me, the bigger problem is that the whole synthesis flow is 
> "nonlinear", insofar as a small change in the input could result in a 
> massive perturbation in the output.  Xilinx's tools (and probably others)
> are deterministic insofar as identical input will always yield the same
> output (I'm fairly sure of this).  However, a tiny perturbation in the
> input (RTL, constraints...) can completely change the final 
> implementation.
> 
> You can work around the unpredictability in critical portions of the 
> design by explicitly instantiating LUTs, carry chains, etc. and locking 
> them to specific locations on the part (again, at least with Xilinx). You
> can use FPGA Editor to rewire a post-PAR design.  You don't _have_ to use
> synthesis at all if the 
> control/performance/effort/maintainability/portability trade-off of doing
> full-custom designs looks right.
> 
> I think that unless you impose external timing constraints, Xilinx's 
> tools try for the lowest possible delay first and minimize area as a 
> secondary goal (possibly with power optimizations as a tertiary goal in 
> recent releases).  I'd worry less about constraints and more about 
> synthesizing for the right part if you're looking for a reasonable timing
> estimate.  I'm assuming "the right part" means XC3S4000-fg676-5 or
> LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion.
> 
You are right, a fixed input generates a fixed output. It's just that
 even small changes in the RTL change the synthesis results in 
unpredictable ways, and i was wondering whether i was doing something 
wrong, but apparently that's just the way it works. I did not set any 
constraints, as i'm just learning to use the tools. I will be of more help when 
it comes to the ASIC version, i hope.



>>> Back to the XP10, if it doesn't have hard multipliers, we can make
>>> our own :) But again i'm not sure how well that works out for FPGAs.
>>> 
>>> 
>> 
>> Yeah.  It's not worth the extra logic to fully pipeline it (nor could 
>> we keep a 32-stage multiplier pipeline fully fed).  We could have 
>> separate logic that would run in parallel.  Or we could have special 
>> mult-stepping instructions.  With the latter, partial multiplies can be
>> optimized to take fewer cycles.
>> 
>> Here's what I'm thinking....
>> 
>> If we had a stand-alone multstep instruction, it would need four 
>> operands:  (1) an accumulator, (2) the mutiplicand, (3) one bit from 
>> the multiplier to determine whether or not the multiplicand is added to
>> the accumulator, and (4) a loop counter from which to compute the 
>> multiplicand left-shift and which bit to take from the multiplier.
>> 
>> Now, I don't like the idea of adding extra state.  What if we want to 
>> add the ability to handle interrupts?  But we can tinker with the idea:
>> Have one special instruction whose job is to load the counter and the
>> multiplier.  The step instruction would have the accumulator (as a
>> source and the target) and the muliplicand.  Each step would step the
>> counter, shift the multiplier, and add (or not, depending on the bit
>> from the multiplier) the shifted multiplicand to the accumulator.  That
>> puts a shifter in line with an adder, though, so maybe we want to load
>> the multiplier and multiplicand in the first instruction (so they're
>> shifted 1 each cycle) and then specify the counter and accumulator in
>> the step?  We'll have to work out the permutations.
>> 
>> BTW, the only reason to do this is because without it, a multiply would
>> take at least 4 times longer due to the overhead of explicit shifts and
>> branches.  Do we care?
>> 
> 
> I would imagine Lattice's tools still infer multipliers and I believe 
> there is a fast(ish) multiplier implementation for the Lattice XP 
> architecture using LUTs and carry chains (it's alluded to in the
> datasheet).
> 
> A fixed one-bit shift after the adder should be so cheap as to make no 
> odds.  It's just a two-input MUX -- it'll fit in one LUT, possibly packed
> with something else.
> 
> All of these implementation ideas seem to be very complex as compared to
>  simply blocking the pipeline.  This nanocontroller isn't going to be an
>  OOO ILP-exploiting powerhouse anyhow.  Why not just pipeline the 
> multiplier enough to maintain the clock speed you're shooting for and let
> that determine the number of cycles that the multiply instruction blocks
> the ALU stage of the pipeline?  This eliminates polluting the instruction
> set with odd instructions, is straightforward to implement, and won't
> slow down the instruction stream any more than any other multicycle
> multiplication would (plus code size would be negligibly smaller).
> 
> A slightly more complex solution that could significantly pick up ILP 
> would be to let the pipeline proceed and block only when the 
> multiplication result is used (no avoiding that) or another 
> multiplication is issued (assuming the multiplier is a simple serial 
> implementation and not fully pipelined).  Whoever's writing assembly (or
>  the assembler itself) could try to schedule multiplications so that the
>  blocking condition occurs rarely, if at all.  Or, you could simply 
> eliminate the hazard-detection logic altogether and presume that the 
> coder or toolchain is dealing with it.
> 
> It should still be feasible to implement an early-exit for narrow 
> multiplications.
> 
> Another approach might be to consider moving the nanocontroller to the 
> Xilinx part to leverage the hard multipliers.  What's the interface 
> between the Spartan and the XP?  Is it possible to leave the DMA on the 
> XP and have the nanocontroller on the Spartan?  Or to move both to the 
> Spartan?  How is the system partitioned between the two FPGAs?
> 
> What kind of code will the nanocontroller be running?  Who'll be writing
>  code for it?  Will it be written in a high-level language or strictly
> in assembly?  What kind of throughput is required (or, in other words, 
> where does the 10ns constraint come from)?  Is there any way to figure 
> out how useful a 32x32 multiply is before worrying about how to implement
> it?
> 
> (Is all this documented somewhere?  If so, please forgive my ignorance 
> and pass on the URL.)
> 
> I'm going to try to get a feel for the cost of multipliers on the Lattice
> part.  They haven't sent me a Synplify license yet, though.
> 
> Incidentally, sorry for coming out of nowhere with all this.  I've been 
> lurking on the list for a couple of months and just spotted somewhere I 
> felt I could chime in helpfully.  I'm a grad student at the University of
> Toronto interested in helping however I can.  If I'm in breach of any 
> etiquette, please let me know.  Cheers,
> 
> Mark. _______________________________________________ Open-graphics
> mailing list [email protected] 
> http://lists.duskglow.com/mailman/listinfo/open-graphics List service
> provided by Duskglow Consulting, LLC (www.duskglow.com)
> 
> 

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Synthesizing oga1hq

Reply via email to