[casper] BEE2 hanging
Hi all. We're working hard on cleaning up our 800 MHz Coherent Dedispersion pulsar machine for production. We have it working with 8 GPU machines, and from 64 to 2048 coarse channels. One problem we have is that with our output FPGA that rearranges the data and ships it off simultaneously over 4 10 GbE ports, sometimes sending an arm() command (which tells the system to start on the next 1 PPS) locks up the communication with that FPGA. The arm command (python) just does 2 writes to the same register, first sending a zero, then sending a one after sleeping for a second. If we kill the program that's trying to write to the fpga, we can unload the bof and reload it, it starts working again. Then it will fail again with an arm() at some random number of times later. It seems to fail more often if we run the system at high speed. Paul says it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC clock rate. Our previous design that is for the regular guppi modes does not do this. Any ideas where to look for this? Does trying to read or write a non-existent register make borph unhappy enough to smite us? Thanks for any insight. John
Re: [casper] BEE2 hanging
Hi John, Are you running this arm() command on the BEE2 or are you using a udp or tcp server? Does it write the value in ascii or binary mode? BORPH has occasionally acted strangely for us when we use ascii mode so we don't use it anymore. Mark On Fri, Jan 29, 2010 at 1:23 PM, John Ford jf...@nrao.edu wrote: Hi all. We're working hard on cleaning up our 800 MHz Coherent Dedispersion pulsar machine for production. We have it working with 8 GPU machines, and from 64 to 2048 coarse channels. One problem we have is that with our output FPGA that rearranges the data and ships it off simultaneously over 4 10 GbE ports, sometimes sending an arm() command (which tells the system to start on the next 1 PPS) locks up the communication with that FPGA. The arm command (python) just does 2 writes to the same register, first sending a zero, then sending a one after sleeping for a second. If we kill the program that's trying to write to the fpga, we can unload the bof and reload it, it starts working again. Then it will fail again with an arm() at some random number of times later. It seems to fail more often if we run the system at high speed. Paul says it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC clock rate. Our previous design that is for the regular guppi modes does not do this. Any ideas where to look for this? Does trying to read or write a non-existent register make borph unhappy enough to smite us? Thanks for any insight. John
Re: [casper] BEE2 hanging
Hi John, Are you running this arm() command on the BEE2 or are you using a udp or tcp server? There is a server on the bee2 that receives the arm() command from a client and then executes it locally on the control FPGA. Does it write the value in ascii or binary mode? Don't know. will find out. BORPH has occasionally acted strangely for us when we use ascii mode so we don't use it anymore. Good to know this. By the way, this is all with version 7.1. Thanks. John Mark On Fri, Jan 29, 2010 at 1:23 PM, John Ford jf...@nrao.edu wrote: Hi all. We're working hard on cleaning up our 800 MHz Coherent Dedispersion pulsar machine for production. We have it working with 8 GPU machines, and from 64 to 2048 coarse channels. One problem we have is that with our output FPGA that rearranges the data and ships it off simultaneously over 4 10 GbE ports, sometimes sending an arm() command (which tells the system to start on the next 1 PPS) locks up the communication with that FPGA. The arm command (python) just does 2 writes to the same register, first sending a zero, then sending a one after sleeping for a second. If we kill the program that's trying to write to the fpga, we can unload the bof and reload it, it starts working again. Then it will fail again with an arm() at some random number of times later. It seems to fail more often if we run the system at high speed. Paul says it doesn't fail at all at 200 MHz, instead of our usual 800 MHz ADC clock rate. Our previous design that is for the regular guppi modes does not do this. Any ideas where to look for this? Does trying to read or write a non-existent register make borph unhappy enough to smite us? Thanks for any insight. John