Hi Jason,

Thanks for the info.  I switched from bootstrap configuration H to C,
and now my boards seem stable.

I followed the instructions at the end of your post from November:
http://www.mail-archive.com/[email protected]/msg01058.html

Billy

Jason Manley wrote:
> 58C is a little high but should be ok. I've seen some PAPER boards
> start dying around 64C, though we've got some that will run happily
> at 70C. The BWRC lab could easily fluctuate by a few degrees as
> people wander in and out or additional equipment is turned on/off.
> So maybe you just have particularly sensitive boards.
> 
> But more likely: Bad PPC memory is the primary reason ROACH boards 
> sometimes die. Billy, please check that you're running boot config C 
> with registered DIMMs in the PPC slot. There are plenty emails in the
> mail archive about this problem, or otherwise speak to Mark. If this
> is already the case, then you should try'n feed cooler air into the
> ROACH. At 20degC, with unblocked vents, you should be fine. Our lab
> runs over 25C on hot days and the boards run without problems, even
> without heatsinks on the PPC.
> 
> It is true that the iStar cases have very poor cooling profiles.
> Phil did some very neat analysis while designing the "ROACH motel".
> There's a pic attached of ROACH boards in some concept cases where
> you can see that the hottest part is actually the QDR. Adding a fan
> blowing down onto the CPU actually heated-up some aspects of the
> board. The real problem is that you cannot get laminar flow over the
> PPC because the PPC DIMM creates turbulence. Just throwing more fans
> in the chassis creates mixed results, depending on where you place
> them. After these exercises, the ROACH motel chassis can run up to 10
> deg cooler than the iStar chassis (depending on how many fans you put
> in there). I've attached a pic of the latest incarnation. Another
> advantage is positive pressure, so you can put filters on the fan
> inlets without having to worry about dust seeping in through all the
> little mounting holes and joints.
> 
> ROACH rack-mount boards are supposed to suck in air on the front, and
> blow hot air out the back. So placing chassis on top of each other
> or spacing them in the rack should have limited effect (radiation
> from top/bottom panel probably biggest difference). If you're seeing
> large changes in temp depending on spacing or rack position, then
> your inlet or exhausts probably aren't right. For example, check that
> there's not another device exhausting warm air into some of the
> inlets, or something blocking an outlet vent. The BEE2s, for example,
> sometimes had reversed airflow and if they're mounted nearby in a
> rack, could be blowing hot air into the roaches.
> 
> The FPGA fan is not regulated - it runs directly off the power rail
> but the speed is monitored. The CPU doesn't even have a heatsink,
> much less a fan.
> 
> Our conclusion here is that one needs to be careful about ventilation
> of the board. The iStar enclosures are poor for this purpose, but if
> used in well-controlled, airconditioned labs then it's ok. The
> primary reason for unreliable ROACH boards (hanging software or seg
> faults or occasional memory errors) is bad memory configuration -
> you want boot option C with registered DIMMs.
> 
> Jason
> 
> 
> Here's a printout from a rackmount board in an iStar chassis in our
> lab from this morning. It is running reliably and is bordered on
> either side by another roach board. It does not have a heatsink on
> the PPC. To make matters worse, the lab's currently close to 30C
> ambient because our aircon's failed. This is about as hot as I would
> recommend you run.
> 
> Current values:
>           Channel        Current       Shutdown       Shutdown
>            Name        value        below        above
> =====================================================================
>          1v5aux:        1.56        1.40        1.60
>    Virtex5 temp:       39.50     -278.00       94.00
>         12V ATX:       11.77        9.98       13.95
>        PPC temp:       65.25     -278.00       82.00
>          5V ATX:        5.00        4.38        5.60
>         3v3 ATX:        3.27        2.99        3.62
>           1V PS:        1.00        0.90        1.05
>          1V5 PS:        1.51        1.40        1.55
>          1V8 PS:        1.81        1.70        1.90
>          2V5 PS:        2.51        2.45        2.54
>       Actl temp:       32.00     -278.00       70.00
>           Fan 1:         0 rpm
>           Fan 2:      5820 rpm
>           Fan 3:         0 rpm
> 
> 
> Dan Werthimer wrote:
>> jason, dave, francois,   cc billy, ed, walt
>>
>> billy is measuring roach PPC chip temperatures of 58C,
>> considerably hotter than the other chips on the roach board.
>>
>> see billy's temperature table in his email third email down
>> in chain below.
>>
>> billy is seeing flaky behavior on warm days,
>> i suspect the PPC is too hot.
>>
>> do you recommend we add heat sinks to the PPC's?
>>
>> is 60C too hot?
>> the data sheet says PPC440EPx can go to 100C, but my guess
>> is that the dram timing is marginal, and we need to cool
>> things down.
>>
>> do you have a recommended heatsink
>> (perhaps something with sticky tape on it?)
>>
>> should we recommend that all roach users install a heatsink?
>>
>> dan
>>
>> Billy Mallard wrote:
>>> Good idea. I'll try separating the enclosures, and maybe do
>>> another test with fans.
>>> 
>>> The CPU/FPGA fan definitely has an RPM monitoring line, so i
>>> assume it's regulated. But i could be wrong.
>>>
>>> Billy
>>>
>>> Walt Fitelson wrote:
>>>> Nice tests, Billy. If you see it again, try 2 simple things:
>>>> 
>>>> 1. Separate enclosure stack with blocks of wood--or even books
>>>> that do not block vent holes.
>>>> 
>>>> 2. Open swinging front doors to get more vent holes.
>>>> 
>>>> Then you might repeat tests, at least the first unloaded
>>>> measurement if you don't want to wait.
>>>> 
>>>> A slightly more difficult test would be mount a fan temporarily
>>>> in the back to blow thru rear panel openings.
>>>> 
>>>> BTW, do you happen to know if the cpu fan speed is regulated?
>>>> Maybe something is funny with that circuit or software.
>>>>
>>>> w.
>>>>
>>>> Billy Mallard wrote:
>>>>> Today, i can't seem to be able to get my boards to crash.
>>>>> 
>>>>> I've been using the lm-sensor utilities in Linux to monitor 
>>>>> temperatures of various parts of the boards. There are three
>>>>> thermistors: one on the FPGA, one on the PowerPC, and one out
>>>>> on the board somewhere.
>>>>> 
>>>>> My enclosures are stacked directly on top of each other. That
>>>>> means there is no top ventilation for the bottom two
>>>>> enclosures. So, i'd expect the board in the bottom enclosure
>>>>> to be the hottest.
>>>>> 
>>>>> I took temperatures at three stages: after booting with an
>>>>> unloaded FPGA, after loading the FPGA with my correlator
>>>>> bitstream, and after running my BRAM readout and UDP
>>>>> packetization / transmit code on the PowerPC. I let
>>>>> temperatures stabilize for ~10min at each stage before 
>>>>> recording a reading.  Here's what i saw:
>>>>>
>>>>> A = bottom (isi0)
>>>>> B = middle (isi2)
>>>>> C = top (isi1)
>>>>>
>>>>> Unloaded FPGA:           A       B       C
>>>>> Virtex-5 Core Temp:   +30.0C  +29.0C  +25.0C
>>>>> PowerPC Core Temp:    +55.0C  +55.0C  +55.0C
>>>>> Monitor Core Temp:    +29.0C  +32.0C  +29.0C
>>>>>
>>>>> Loaded FPGA:             A       B       C
>>>>> Virtex-5 Core Temp:   +44.0C  +43.0C  +39.0C
>>>>> PowerPC Core Temp:    +58.0C  +59.0C  +56.0C
>>>>> Monitor Core Temp:    +30.0C  +33.0C  +31.0C
>>>>>
>>>>> Packetizing PowerPC:     A       B       C
>>>>> Virtex-5 Core Temp:   +45.0C  +44.0C  +40.0C
>>>>> PowerPC Core Temp:    +59.0C  +60.0C  +57.0C
>>>>> Monitor Core Temp:    +31.0C  +34.0C  +32.0C
>>>>>
>>>>> Board temperatures look stable. PowerPC temperatures only
>>>>> ever increase by a ~2 degrees, and they never go above 60C.
>>>>> The FPGA temperature consistently rises by 15 degrees.
>>>>> 
>>>>> Maybe it was warmer in the lab yesterday, when my boards kept
>>>>> becoming unresponsive? If i start noticing crashes again,
>>>>> i'll do another round of thermal readings. But for now,
>>>>> things appear to be working.
>>>>>
>>>>> Billy

Reply via email to