[casper] BEE2 hanging

2010-01-29 Thread John Ford
Hi all.

We're working hard on cleaning up our 800 MHz Coherent Dedispersion pulsar
machine for production.  We have it working with 8 GPU machines, and from
64 to 2048 coarse channels.

One problem we have is that with our output FPGA that rearranges the data
and ships it off simultaneously over 4 10 GbE ports, sometimes sending an
arm() command (which tells the system to start on the next 1 PPS) locks up
the communication with that FPGA.

The arm command (python) just does 2 writes to the same register, first
sending a zero, then sending a one after sleeping for a second.

If we kill the program that's trying to write to the fpga, we can unload
the bof and reload it, it starts working again.  Then it will fail again
with an arm() at some random number of times later.

It seems to fail more often if we run the system at high speed.  Paul says
it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC clock
rate.

Our previous design that is for the regular guppi modes does not do this.

Any ideas where to look for this?

Does trying to read or write a non-existent register make borph unhappy
enough to smite us?

Thanks for any insight.

John





Re: [casper] BEE2 hanging

2010-01-29 Thread Mark Wagner
Hi John,

Are you running this arm() command on the BEE2 or are you using a udp or tcp
server?  Does it write the value in ascii or binary mode?  BORPH has
occasionally acted strangely for us when we use ascii mode so we don't use
it anymore.

Mark

On Fri, Jan 29, 2010 at 1:23 PM, John Ford jf...@nrao.edu wrote:

 Hi all.

 We're working hard on cleaning up our 800 MHz Coherent Dedispersion pulsar
 machine for production.  We have it working with 8 GPU machines, and from
 64 to 2048 coarse channels.

 One problem we have is that with our output FPGA that rearranges the data
 and ships it off simultaneously over 4 10 GbE ports, sometimes sending an
 arm() command (which tells the system to start on the next 1 PPS) locks up
 the communication with that FPGA.

 The arm command (python) just does 2 writes to the same register, first
 sending a zero, then sending a one after sleeping for a second.

 If we kill the program that's trying to write to the fpga, we can unload
 the bof and reload it, it starts working again.  Then it will fail again
 with an arm() at some random number of times later.

 It seems to fail more often if we run the system at high speed.  Paul says
 it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC clock
 rate.

 Our previous design that is for the regular guppi modes does not do this.

 Any ideas where to look for this?

 Does trying to read or write a non-existent register make borph unhappy
 enough to smite us?

 Thanks for any insight.

 John






Re: [casper] BEE2 hanging

2010-01-29 Thread John Ford
 Hi John,

 Are you running this arm() command on the BEE2 or are you using a udp or
 tcp
 server?

There is a server on the bee2 that receives the arm() command from a
client and then executes it locally on the control FPGA.

 Does it write the value in ascii or binary mode?

Don't know. will find out.

 BORPH has
 occasionally acted strangely for us when we use ascii mode so we don't use
 it anymore.

Good to know this.  By the way, this is all with version 7.1.

Thanks.

John


 Mark

 On Fri, Jan 29, 2010 at 1:23 PM, John Ford jf...@nrao.edu wrote:

 Hi all.

 We're working hard on cleaning up our 800 MHz Coherent Dedispersion
 pulsar
 machine for production.  We have it working with 8 GPU machines, and
 from
 64 to 2048 coarse channels.

 One problem we have is that with our output FPGA that rearranges the
 data
 and ships it off simultaneously over 4 10 GbE ports, sometimes sending
 an
 arm() command (which tells the system to start on the next 1 PPS) locks
 up
 the communication with that FPGA.

 The arm command (python) just does 2 writes to the same register, first
 sending a zero, then sending a one after sleeping for a second.

 If we kill the program that's trying to write to the fpga, we can unload
 the bof and reload it, it starts working again.  Then it will fail again
 with an arm() at some random number of times later.

 It seems to fail more often if we run the system at high speed.  Paul
 says
 it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC
 clock
 rate.

 Our previous design that is for the regular guppi modes does not do
 this.

 Any ideas where to look for this?

 Does trying to read or write a non-existent register make borph unhappy
 enough to smite us?

 Thanks for any insight.

 John