Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-05 Thread Jack Hickish
Hi Everyone,

Thanks all for the advice. Based on a few experiments so far (skip to
4 for what I think is a disappointinly simple solution, that I was too
stupid to see in the XST manual) --

1. My fabric utilisation isn't that high, although digging around
planAhead there are some areas with high routing congestion. I wonder
how much this throws off the compiler.

2. Making the shift registers that are causing problems implement as
cores, rather than behavioural HDL doesn't seem to solve the problem,
the tools will quite happily combine two such cores into one LUT.

3. Explicitly disabling SRLs (by either putting lots of single delays
as cores / adding resets / (*shreg_extract = NO*)-ing HDL code makes
the problem go away for the individual delay (since now there aren't
any LUTs to combine). But mostly the symptom will just appear
somewhere else. (I haven't tried the nuclear SRL global disable, but
I'd be amazed if that didn't just cause my design to explode).

4. Resynthesizing the netlists with -lc off seems to have made all
the issues I was having disappear. At least in the timing report I've
read the headline spectacular fails have gone. Map reports that there
are still some SRLs using both O5 and O6 outputs, but I've got a bunch
of pcores, and I haven't resynth'd them all yet.
I'm a bit surprised that this option is needed in XST to avoid
combining luts that exist in different pcores, I would have thought
turning it off in map would be sufficient, but I guess I was wrong.
Maybe I would have figured this out sooner if I'd properly read an
up-to-date XST manual -- it appears that the default behaviour of lut
combining in XST has gone from 'off' in Virtex 5, to 'auto' in Virtex
6.

So bottom line, maybe -lc is an option worth playing with in future if
designs are failing timing with bizarre signal paths.

Thanks again for the help (and big shoutout for resynth_netlist, which
I certainly didn't realise was added by Dave 4 YEARS AGO!).

Jack

On 5 December 2014 at 07:01, Jason Manley jman...@ska.ac.za wrote:
 I often re-run XST with:

 register_balancing yes
 optimize_primitives yes
 read_cores yes
 shreg_extract no

 shreg_extract prevents adjacent registers from being combined into SRL16s.

 Jason Manley
 CBF Manager
 SKA-SA

 Cell: +27 82 662 7726
 Work: +27 21 506 7300

 On 05 Dec 2014, at 6:27, Henno Kriel he...@ska.ac.za wrote:

 Hi Jack,

 In Simulink if have seen similar issues when trying to add more register 
 pipelining, to decrease routing delay's and thus increase Fmax.
 However, ISE just collapses all the pipelining into a single SRL, which 
 yields the frustrations you mentioned.
 You can prevent this from happening, by adding a synchronise reset (one of 
 the tick boxes on delay block) to your pipelining registers.
 You will have to connect up a reset signal from a register (but you don't 
 actually need to use it),
 to ensure that it does not get optimized away.
 In my case this normally resolves the routing issue and achieves timing 
 closure.

 Hope this helps.
 HK



 On Thu, Dec 4, 2014 at 9:37 PM, Jack Hickish jackhick...@gmail.com wrote:
 Hey Mark,

 Yeah, I guess I could manually force the locations of the two offending 
 shift-regs to stop the combination, but the problem SRLs seem to be a fairly 
 arbitrary selection of those in the design. I don't really want to have to 
 start constraining at the LUT level if I can help it. But maybe I'll try and 
 see if the problem goes away, or just emerges somewhere else.

 Hi Dave,

 I have been through all the planAhead options, as well as the 
 fast_runtime.opt settings in the base package (I've been using both flows) 
 and (tried to) set everything to optimize for speed. The -lt option to me 
 seems like it should control the behaviour I'm seeing, but it doesn't seem 
 to. I'm using pblocks, but have been almost exclusively been constraining 
 only rams/dsps. As above, I'm about to try forcing the placements. I haven't 
 run resynth netlist on my simulink design, but equivalent register removal 
 is turned off in planAhead and some of the signals it appears to be 
 LUT-combining belong to different pcores, so I thought that planahead 
 settings should be enough. (obviously I could be wrong).
 In any case, I didn't think this was an equivalent register removal problem. 
 It's not like multiple copies of the same register are being merged at the 
 expense of fanout, just a 2-clock data delay inside an X-engine might be 
 merged with a 2-clock delay of some data signal in an FFT. But again, maybe 
 I'm understanding the options wrong, so I'll try resynthing the netlist and 
 see if that helps.

 Thanks for your help, both.

 Jack



 On Thu Dec 04 2014 at 19:18:35 David MacMahon dav...@astro.berkeley.edu 
 wrote:
 Hi, Jack,

 Are the tools are optimizing for area instead of speed?  Are you using 
 Pblocks?

 I don't know if this is relevant to your situation, but I've run into 
 annoyances when the tools use equivalent register removal to 

Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-04 Thread Mark Wagner
Hi Jack,

Not sure if this will help, but in Planahead I would try to click and drag
that LUT as close as possible to each of the outputs.  And if that doesn't
help or makes it worse, you could also try to duplicate the logic going to
each of those outputs, forcing separate LUTs to be used.

Cheers,
Mark


On Thu, Dec 4, 2014 at 10:48 AM, Jack Hickish jackhick...@gmail.com wrote:

 Hi all,

 This is something I've been fighting with for a while now, and I wonder if
 anyone on this maillist has any insight (because I'm pretty sure I may just
 be doing something wrong with the tools).

 The problem:
 I'm playing with a ROACH2 design that (sometimes) compiles at 312 MHz.
 However, every now and then I'll make a small change to the design and the
 compile will fail timing catastrophically, with paths failing sometimes
 with -2 ns (or worse) slack.
 When I look at the failing path(s), the delays are usually ~80% routing.
 I'll see a signal take a huge detour to use a shift register in some
 arbitrary location on the chip. Upon closer inspection of the relevant SRL,
 it appears that the LUT concerned is being used for two signal paths, one
 on the O5 output, one on the O6. The result seems to be that it is poorly
 placed for both it's roles.

 I'm only using ~50% of the slices and about 30% of the registers / luts on
 the FPGA, and there are plenty of sensibly located SLICEMs the placer could
 use if it so desired. I've switched lut combining off (with the -lt flag),
 in planahead which doesn't seem to have made any difference.

 Can anyone offer me any words of advice / wisdom which might reduce my
 confusion at what's going on (or, even better, help me solve the problem)?

 Despairingly yours,
 Jack





Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-04 Thread David MacMahon
Hi, Jack,

Are the tools are optimizing for area instead of speed?  Are you using Pblocks?

I don't know if this is relevant to your situation, but I've run into 
annoyances when the tools use equivalent register removal to save a few 
flip-flops but end up causing fan-out/routing issues.  That can be turned off, 
but it's a synthesis option so if you want to apply it to a System Generator 
netlist, you have to use the resynth_netlist Matlab function from the casper 
library to re-synthesize the entire netlist.

Dave

On Dec 4, 2014, at 10:48 AM, Jack Hickish wrote:

 Hi all,
 
 This is something I've been fighting with for a while now, and I wonder if 
 anyone on this maillist has any insight (because I'm pretty sure I may just 
 be doing something wrong with the tools).
 
 The problem:
 I'm playing with a ROACH2 design that (sometimes) compiles at 312 MHz. 
 However, every now and then I'll make a small change to the design and the 
 compile will fail timing catastrophically, with paths failing sometimes with 
 -2 ns (or worse) slack.
 When I look at the failing path(s), the delays are usually ~80% routing. I'll 
 see a signal take a huge detour to use a shift register in some arbitrary 
 location on the chip. Upon closer inspection of the relevant SRL, it appears 
 that the LUT concerned is being used for two signal paths, one on the O5 
 output, one on the O6. The result seems to be that it is poorly placed for 
 both it's roles.
 
 I'm only using ~50% of the slices and about 30% of the registers / luts on 
 the FPGA, and there are plenty of sensibly located SLICEMs the placer could 
 use if it so desired. I've switched lut combining off (with the -lt flag), in 
 planahead which doesn't seem to have made any difference.
 
 Can anyone offer me any words of advice / wisdom which might reduce my 
 confusion at what's going on (or, even better, help me solve the problem)?
 
 Despairingly yours,
 Jack
 
 




Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-04 Thread Jack Hickish
Hey Mark,

Yeah, I guess I could manually force the locations of the two offending
shift-regs to stop the combination, but the problem SRLs seem to be a
fairly arbitrary selection of those in the design. I don't really want to
have to start constraining at the LUT level if I can help it. But maybe
I'll try and see if the problem goes away, or just emerges somewhere else.

Hi Dave,

I have been through all the planAhead options, as well as the
fast_runtime.opt settings in the base package (I've been using both flows)
and (tried to) set everything to optimize for speed. The -lt option to me
seems like it should control the behaviour I'm seeing, but it doesn't seem
to. I'm using pblocks, but have been almost exclusively been constraining
only rams/dsps. As above, I'm about to try forcing the placements. I
haven't run resynth netlist on my simulink design, but equivalent register
removal is turned off in planAhead and some of the signals it appears to be
LUT-combining belong to different pcores, so I thought that planahead
settings should be enough. (obviously I could be wrong).
In any case, I didn't think this was an equivalent register removal
problem. It's not like multiple copies of the same register are being
merged at the expense of fanout, just a 2-clock data delay inside an
X-engine might be merged with a 2-clock delay of some data signal in an
FFT. But again, maybe I'm understanding the options wrong, so I'll try
resynthing the netlist and see if that helps.

Thanks for your help, both.

Jack



On Thu Dec 04 2014 at 19:18:35 David MacMahon dav...@astro.berkeley.edu
wrote:

 Hi, Jack,

 Are the tools are optimizing for area instead of speed?  Are you using
 Pblocks?

 I don't know if this is relevant to your situation, but I've run into
 annoyances when the tools use equivalent register removal to save a few
 flip-flops but end up causing fan-out/routing issues.  That can be turned
 off, but it's a synthesis option so if you want to apply it to a System
 Generator netlist, you have to use the resynth_netlist Matlab function
 from the casper library to re-synthesize the entire netlist.

 Dave

 On Dec 4, 2014, at 10:48 AM, Jack Hickish wrote:

  Hi all,
 
  This is something I've been fighting with for a while now, and I wonder
 if anyone on this maillist has any insight (because I'm pretty sure I may
 just be doing something wrong with the tools).
 
  The problem:
  I'm playing with a ROACH2 design that (sometimes) compiles at 312 MHz.
 However, every now and then I'll make a small change to the design and the
 compile will fail timing catastrophically, with paths failing sometimes
 with -2 ns (or worse) slack.
  When I look at the failing path(s), the delays are usually ~80% routing.
 I'll see a signal take a huge detour to use a shift register in some
 arbitrary location on the chip. Upon closer inspection of the relevant SRL,
 it appears that the LUT concerned is being used for two signal paths, one
 on the O5 output, one on the O6. The result seems to be that it is poorly
 placed for both it's roles.
 
  I'm only using ~50% of the slices and about 30% of the registers / luts
 on the FPGA, and there are plenty of sensibly located SLICEMs the placer
 could use if it so desired. I've switched lut combining off (with the -lt
 flag), in planahead which doesn't seem to have made any difference.
 
  Can anyone offer me any words of advice / wisdom which might reduce my
 confusion at what's going on (or, even better, help me solve the problem)?
 
  Despairingly yours,
  Jack
 
 




Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-04 Thread Henno Kriel
Hi Jack,

In Simulink if have seen similar issues when trying to add more register
pipelining, to decrease routing delay's and thus increase Fmax.
However, ISE just collapses all the pipelining into a single SRL, which
yields the frustrations you mentioned.
You can prevent this from happening, by adding a synchronise reset (one of
the tick boxes on delay block) to your pipelining registers.
You will have to connect up a reset signal from a register (but you don't
actually need to use it),
to ensure that it does not get optimized away.
In my case this normally resolves the routing issue and achieves timing
closure.

Hope this helps.
HK



On Thu, Dec 4, 2014 at 9:37 PM, Jack Hickish jackhick...@gmail.com wrote:

 Hey Mark,

 Yeah, I guess I could manually force the locations of the two offending
 shift-regs to stop the combination, but the problem SRLs seem to be a
 fairly arbitrary selection of those in the design. I don't really want to
 have to start constraining at the LUT level if I can help it. But maybe
 I'll try and see if the problem goes away, or just emerges somewhere else.

 Hi Dave,

 I have been through all the planAhead options, as well as the
 fast_runtime.opt settings in the base package (I've been using both flows)
 and (tried to) set everything to optimize for speed. The -lt option to me
 seems like it should control the behaviour I'm seeing, but it doesn't seem
 to. I'm using pblocks, but have been almost exclusively been constraining
 only rams/dsps. As above, I'm about to try forcing the placements. I
 haven't run resynth netlist on my simulink design, but equivalent register
 removal is turned off in planAhead and some of the signals it appears to be
 LUT-combining belong to different pcores, so I thought that planahead
 settings should be enough. (obviously I could be wrong).
 In any case, I didn't think this was an equivalent register removal
 problem. It's not like multiple copies of the same register are being
 merged at the expense of fanout, just a 2-clock data delay inside an
 X-engine might be merged with a 2-clock delay of some data signal in an
 FFT. But again, maybe I'm understanding the options wrong, so I'll try
 resynthing the netlist and see if that helps.

 Thanks for your help, both.

 Jack



 On Thu Dec 04 2014 at 19:18:35 David MacMahon dav...@astro.berkeley.edu
 wrote:

 Hi, Jack,

 Are the tools are optimizing for area instead of speed?  Are you using
 Pblocks?

 I don't know if this is relevant to your situation, but I've run into
 annoyances when the tools use equivalent register removal to save a few
 flip-flops but end up causing fan-out/routing issues.  That can be turned
 off, but it's a synthesis option so if you want to apply it to a System
 Generator netlist, you have to use the resynth_netlist Matlab function
 from the casper library to re-synthesize the entire netlist.

 Dave

 On Dec 4, 2014, at 10:48 AM, Jack Hickish wrote:

  Hi all,
 
  This is something I've been fighting with for a while now, and I wonder
 if anyone on this maillist has any insight (because I'm pretty sure I may
 just be doing something wrong with the tools).
 
  The problem:
  I'm playing with a ROACH2 design that (sometimes) compiles at 312 MHz.
 However, every now and then I'll make a small change to the design and the
 compile will fail timing catastrophically, with paths failing sometimes
 with -2 ns (or worse) slack.
  When I look at the failing path(s), the delays are usually ~80%
 routing. I'll see a signal take a huge detour to use a shift register in
 some arbitrary location on the chip. Upon closer inspection of the relevant
 SRL, it appears that the LUT concerned is being used for two signal paths,
 one on the O5 output, one on the O6. The result seems to be that it is
 poorly placed for both it's roles.
 
  I'm only using ~50% of the slices and about 30% of the registers / luts
 on the FPGA, and there are plenty of sensibly located SLICEMs the placer
 could use if it so desired. I've switched lut combining off (with the -lt
 flag), in planahead which doesn't seem to have made any difference.
 
  Can anyone offer me any words of advice / wisdom which might reduce my
 confusion at what's going on (or, even better, help me solve the problem)?
 
  Despairingly yours,
  Jack
 
 




-- 
Kind regards,
Henno Kriel

DBE: Hardware Manager

SKA South Africa
Third Floor
The Park
Park Road (off Alexandra Road)
Pinelands
7405
Western Cape
South Africa

Latitude: -33.94329 (South); Longitude: 18.48945 (East).

(p) +27 (0)21 506 7300
(p) +27 (0)21 506 7374 (direct)
(f) +27 (0)21 506 7375
(m) +27 (0)84 504 5050


Re: [casper] Compiler merging SRLs -- Timing performance

2014-12-04 Thread Jason Manley
I often re-run XST with:

register_balancing yes
optimize_primitives yes
read_cores yes
shreg_extract no

shreg_extract prevents adjacent registers from being combined into SRL16s.

Jason Manley
CBF Manager
SKA-SA

Cell: +27 82 662 7726
Work: +27 21 506 7300

On 05 Dec 2014, at 6:27, Henno Kriel he...@ska.ac.za wrote:

 Hi Jack,
 
 In Simulink if have seen similar issues when trying to add more register 
 pipelining, to decrease routing delay's and thus increase Fmax.
 However, ISE just collapses all the pipelining into a single SRL, which 
 yields the frustrations you mentioned. 
 You can prevent this from happening, by adding a synchronise reset (one of 
 the tick boxes on delay block) to your pipelining registers. 
 You will have to connect up a reset signal from a register (but you don't 
 actually need to use it), 
 to ensure that it does not get optimized away. 
 In my case this normally resolves the routing issue and achieves timing 
 closure.
 
 Hope this helps.
 HK
 
 
 
 On Thu, Dec 4, 2014 at 9:37 PM, Jack Hickish jackhick...@gmail.com wrote:
 Hey Mark,
 
 Yeah, I guess I could manually force the locations of the two offending 
 shift-regs to stop the combination, but the problem SRLs seem to be a fairly 
 arbitrary selection of those in the design. I don't really want to have to 
 start constraining at the LUT level if I can help it. But maybe I'll try and 
 see if the problem goes away, or just emerges somewhere else.
 
 Hi Dave,
 
 I have been through all the planAhead options, as well as the 
 fast_runtime.opt settings in the base package (I've been using both flows) 
 and (tried to) set everything to optimize for speed. The -lt option to me 
 seems like it should control the behaviour I'm seeing, but it doesn't seem 
 to. I'm using pblocks, but have been almost exclusively been constraining 
 only rams/dsps. As above, I'm about to try forcing the placements. I haven't 
 run resynth netlist on my simulink design, but equivalent register removal is 
 turned off in planAhead and some of the signals it appears to be 
 LUT-combining belong to different pcores, so I thought that planahead 
 settings should be enough. (obviously I could be wrong). 
 In any case, I didn't think this was an equivalent register removal problem. 
 It's not like multiple copies of the same register are being merged at the 
 expense of fanout, just a 2-clock data delay inside an X-engine might be 
 merged with a 2-clock delay of some data signal in an FFT. But again, maybe 
 I'm understanding the options wrong, so I'll try resynthing the netlist and 
 see if that helps.
 
 Thanks for your help, both.
 
 Jack
 
 
 
 On Thu Dec 04 2014 at 19:18:35 David MacMahon dav...@astro.berkeley.edu 
 wrote:
 Hi, Jack,
 
 Are the tools are optimizing for area instead of speed?  Are you using 
 Pblocks?
 
 I don't know if this is relevant to your situation, but I've run into 
 annoyances when the tools use equivalent register removal to save a few 
 flip-flops but end up causing fan-out/routing issues.  That can be turned 
 off, but it's a synthesis option so if you want to apply it to a System 
 Generator netlist, you have to use the resynth_netlist Matlab function from 
 the casper library to re-synthesize the entire netlist.
 
 Dave
 
 On Dec 4, 2014, at 10:48 AM, Jack Hickish wrote:
 
  Hi all,
 
  This is something I've been fighting with for a while now, and I wonder if 
  anyone on this maillist has any insight (because I'm pretty sure I may just 
  be doing something wrong with the tools).
 
  The problem:
  I'm playing with a ROACH2 design that (sometimes) compiles at 312 MHz. 
  However, every now and then I'll make a small change to the design and the 
  compile will fail timing catastrophically, with paths failing sometimes 
  with -2 ns (or worse) slack.
  When I look at the failing path(s), the delays are usually ~80% routing. 
  I'll see a signal take a huge detour to use a shift register in some 
  arbitrary location on the chip. Upon closer inspection of the relevant SRL, 
  it appears that the LUT concerned is being used for two signal paths, one 
  on the O5 output, one on the O6. The result seems to be that it is poorly 
  placed for both it's roles.
 
  I'm only using ~50% of the slices and about 30% of the registers / luts on 
  the FPGA, and there are plenty of sensibly located SLICEMs the placer could 
  use if it so desired. I've switched lut combining off (with the -lt flag), 
  in planahead which doesn't seem to have made any difference.
 
  Can anyone offer me any words of advice / wisdom which might reduce my 
  confusion at what's going on (or, even better, help me solve the problem)?
 
  Despairingly yours,
  Jack
 
 
 
 
 
 
 -- 
 Kind regards,
 Henno Kriel
 
 DBE: Hardware Manager
 
 SKA South Africa
 Third Floor
 The Park
 Park Road (off Alexandra Road)
 Pinelands
 7405
 Western Cape
 South Africa
 
 Latitude: -33.94329 (South); Longitude: 18.48945 (East).
 
 (p) +27 (0)21 506 7300
 (p) +27 (0)21 506