Hi Jason, Thanks for the info. I switched from bootstrap configuration H to C, and now my boards seem stable.
I followed the instructions at the end of your post from November: http://www.mail-archive.com/[email protected]/msg01058.html Billy Jason Manley wrote: > 58C is a little high but should be ok. I've seen some PAPER boards > start dying around 64C, though we've got some that will run happily > at 70C. The BWRC lab could easily fluctuate by a few degrees as > people wander in and out or additional equipment is turned on/off. > So maybe you just have particularly sensitive boards. > > But more likely: Bad PPC memory is the primary reason ROACH boards > sometimes die. Billy, please check that you're running boot config C > with registered DIMMs in the PPC slot. There are plenty emails in the > mail archive about this problem, or otherwise speak to Mark. If this > is already the case, then you should try'n feed cooler air into the > ROACH. At 20degC, with unblocked vents, you should be fine. Our lab > runs over 25C on hot days and the boards run without problems, even > without heatsinks on the PPC. > > It is true that the iStar cases have very poor cooling profiles. > Phil did some very neat analysis while designing the "ROACH motel". > There's a pic attached of ROACH boards in some concept cases where > you can see that the hottest part is actually the QDR. Adding a fan > blowing down onto the CPU actually heated-up some aspects of the > board. The real problem is that you cannot get laminar flow over the > PPC because the PPC DIMM creates turbulence. Just throwing more fans > in the chassis creates mixed results, depending on where you place > them. After these exercises, the ROACH motel chassis can run up to 10 > deg cooler than the iStar chassis (depending on how many fans you put > in there). I've attached a pic of the latest incarnation. Another > advantage is positive pressure, so you can put filters on the fan > inlets without having to worry about dust seeping in through all the > little mounting holes and joints. > > ROACH rack-mount boards are supposed to suck in air on the front, and > blow hot air out the back. So placing chassis on top of each other > or spacing them in the rack should have limited effect (radiation > from top/bottom panel probably biggest difference). If you're seeing > large changes in temp depending on spacing or rack position, then > your inlet or exhausts probably aren't right. For example, check that > there's not another device exhausting warm air into some of the > inlets, or something blocking an outlet vent. The BEE2s, for example, > sometimes had reversed airflow and if they're mounted nearby in a > rack, could be blowing hot air into the roaches. > > The FPGA fan is not regulated - it runs directly off the power rail > but the speed is monitored. The CPU doesn't even have a heatsink, > much less a fan. > > Our conclusion here is that one needs to be careful about ventilation > of the board. The iStar enclosures are poor for this purpose, but if > used in well-controlled, airconditioned labs then it's ok. The > primary reason for unreliable ROACH boards (hanging software or seg > faults or occasional memory errors) is bad memory configuration - > you want boot option C with registered DIMMs. > > Jason > > > Here's a printout from a rackmount board in an iStar chassis in our > lab from this morning. It is running reliably and is bordered on > either side by another roach board. It does not have a heatsink on > the PPC. To make matters worse, the lab's currently close to 30C > ambient because our aircon's failed. This is about as hot as I would > recommend you run. > > Current values: > Channel Current Shutdown Shutdown > Name value below above > ===================================================================== > 1v5aux: 1.56 1.40 1.60 > Virtex5 temp: 39.50 -278.00 94.00 > 12V ATX: 11.77 9.98 13.95 > PPC temp: 65.25 -278.00 82.00 > 5V ATX: 5.00 4.38 5.60 > 3v3 ATX: 3.27 2.99 3.62 > 1V PS: 1.00 0.90 1.05 > 1V5 PS: 1.51 1.40 1.55 > 1V8 PS: 1.81 1.70 1.90 > 2V5 PS: 2.51 2.45 2.54 > Actl temp: 32.00 -278.00 70.00 > Fan 1: 0 rpm > Fan 2: 5820 rpm > Fan 3: 0 rpm > > > Dan Werthimer wrote: >> jason, dave, francois, cc billy, ed, walt >> >> billy is measuring roach PPC chip temperatures of 58C, >> considerably hotter than the other chips on the roach board. >> >> see billy's temperature table in his email third email down >> in chain below. >> >> billy is seeing flaky behavior on warm days, >> i suspect the PPC is too hot. >> >> do you recommend we add heat sinks to the PPC's? >> >> is 60C too hot? >> the data sheet says PPC440EPx can go to 100C, but my guess >> is that the dram timing is marginal, and we need to cool >> things down. >> >> do you have a recommended heatsink >> (perhaps something with sticky tape on it?) >> >> should we recommend that all roach users install a heatsink? >> >> dan >> >> Billy Mallard wrote: >>> Good idea. I'll try separating the enclosures, and maybe do >>> another test with fans. >>> >>> The CPU/FPGA fan definitely has an RPM monitoring line, so i >>> assume it's regulated. But i could be wrong. >>> >>> Billy >>> >>> Walt Fitelson wrote: >>>> Nice tests, Billy. If you see it again, try 2 simple things: >>>> >>>> 1. Separate enclosure stack with blocks of wood--or even books >>>> that do not block vent holes. >>>> >>>> 2. Open swinging front doors to get more vent holes. >>>> >>>> Then you might repeat tests, at least the first unloaded >>>> measurement if you don't want to wait. >>>> >>>> A slightly more difficult test would be mount a fan temporarily >>>> in the back to blow thru rear panel openings. >>>> >>>> BTW, do you happen to know if the cpu fan speed is regulated? >>>> Maybe something is funny with that circuit or software. >>>> >>>> w. >>>> >>>> Billy Mallard wrote: >>>>> Today, i can't seem to be able to get my boards to crash. >>>>> >>>>> I've been using the lm-sensor utilities in Linux to monitor >>>>> temperatures of various parts of the boards. There are three >>>>> thermistors: one on the FPGA, one on the PowerPC, and one out >>>>> on the board somewhere. >>>>> >>>>> My enclosures are stacked directly on top of each other. That >>>>> means there is no top ventilation for the bottom two >>>>> enclosures. So, i'd expect the board in the bottom enclosure >>>>> to be the hottest. >>>>> >>>>> I took temperatures at three stages: after booting with an >>>>> unloaded FPGA, after loading the FPGA with my correlator >>>>> bitstream, and after running my BRAM readout and UDP >>>>> packetization / transmit code on the PowerPC. I let >>>>> temperatures stabilize for ~10min at each stage before >>>>> recording a reading. Here's what i saw: >>>>> >>>>> A = bottom (isi0) >>>>> B = middle (isi2) >>>>> C = top (isi1) >>>>> >>>>> Unloaded FPGA: A B C >>>>> Virtex-5 Core Temp: +30.0C +29.0C +25.0C >>>>> PowerPC Core Temp: +55.0C +55.0C +55.0C >>>>> Monitor Core Temp: +29.0C +32.0C +29.0C >>>>> >>>>> Loaded FPGA: A B C >>>>> Virtex-5 Core Temp: +44.0C +43.0C +39.0C >>>>> PowerPC Core Temp: +58.0C +59.0C +56.0C >>>>> Monitor Core Temp: +30.0C +33.0C +31.0C >>>>> >>>>> Packetizing PowerPC: A B C >>>>> Virtex-5 Core Temp: +45.0C +44.0C +40.0C >>>>> PowerPC Core Temp: +59.0C +60.0C +57.0C >>>>> Monitor Core Temp: +31.0C +34.0C +32.0C >>>>> >>>>> Board temperatures look stable. PowerPC temperatures only >>>>> ever increase by a ~2 degrees, and they never go above 60C. >>>>> The FPGA temperature consistently rises by 15 degrees. >>>>> >>>>> Maybe it was warmer in the lab yesterday, when my boards kept >>>>> becoming unresponsive? If i start noticing crashes again, >>>>> i'll do another round of thermal readings. But for now, >>>>> things appear to be working. >>>>> >>>>> Billy

