Hi, Jonathan,

The ROACH2s at GB output over all 8 SFP+ ports very often without problem.  Not 
sure whether this matters, but they are connected via fiber optic transceivers 
rather than copper cables.

HTH,
Dave

> On Apr 19, 2018, at 11:00, Jonathan Weintroub <[email protected]> 
> wrote:
> 
> Dear kind CASPER Colleagues,
> 
> To offer a little more feedback on this:
> 
> —We reiterate that all advice is appreciated and useful.  They may well be 
> relevant to prior weird experiences, however in the current case . . .
> 
> — . . . after assorted power cycles removing all inputs, confirmation that 
> the unit has an approved FSP power supply, and swapping in spares both at the 
> LRU and NIC level, we are now convinced that our current issues with one 
> 10GigE port of 8 going down are not ROACH2 hardware related, but rather 
> something to do with the environment in which it is installed (i.e. related 
> to external stimuli).  Still investigating.
> 
> —One unusual aspect of this application is we are using all 8 SFP+ ports on 
> the ROACH2, though we are not stressing the rates. It is a long shot, but are 
> there any insights into possible stresses or snafus we might run into when 
> fully utilizing the ROACH 10GigE NIC ports?
> 
> Thanks again.
> 
> Jonathan & crew
> 
> 
> 
> 
>> On Apr 18, 2018, at 10:13 AM, Jonathan Weintroub <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Jonathon,
>> 
>> Your important input here warrants cc to the mailing list, hereby 
>> accomplished.
>> 
>> We have switched to the FSP power supplies for new builds, and have repaired 
>> older ROACH2s a number of which have had failing XEALs (mostly) by replacing 
>> same with FSPs.  We have I think done some prophylactic FSP replacements in 
>> offline spare stock.  But we’ve ordered and deployed probably over 100 
>> ROACH2s over about a four, perhaps even five year period, they are used at 
>> SMA for SWARM, and also distributed all over the world for the EHT.  So we 
>> have NOT retrofitted every unit out there with FSP power supplies.  
>> 
>> While the XEAL are known to be not reliable, when a unit is working, it's 
>> not that straightforward to recall it for a power supply replacement—ain’t 
>> broke don’t fix applies.
>> 
>> Thanks for your input. Thanks also for input from Dan, Jason, Matt and Mike, 
>> which is valuable and relevant advice.  I was holding off on responding, 
>> we’re at SMA running tests, and don’t yet know the resolution for the units 
>> in question.  
>> 
>> Jonathon’s email triggered this interim response. I’ll let all know the 
>> outcome on the lightening damage when we have one.
>> 
>> Thanks,
>> 
>> Jonathan
>> 
>> 
>> 
>>> On Apr 18, 2018, at 9:44 AM, Jonathon Kocz <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Jonathan,
>>> 
>>> I think you've already addressed this, but to double check, are these R2s 
>>> after you switched to the SP25-60FAG power supply?
>>> 
>>> I've had a lot of trouble with R2s using istar/xeal supplies getting into 
>>> strange situations that always seem fixable with a new power supply. 
>>> 
>>> Cheers,
>>> Jonathon
>>> 
>>> On 17 April 2018 at 16:22, Jonathan Weintroub <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi CASPERites,
>>> 
>>> With experience on quite a few ROACH2s in the lab and in the field for some 
>>> years, and a pattern has emerged which warrants a question to the ROACH2 
>>> experts on this list. The SAO team has seen strange faults happen on 
>>> multiple ROACH2 units after power failures, dips and lightening storms.   
>>> I’ll list the various weirdnesses below, but the key point is while a full 
>>> power cycle, including removing power from the line input, does not reset 
>>> and cure the units. But extended power down (like overnight, or 24 hours, 
>>> or more) does seem to bring the units back to life again.  This was 
>>> discovered serendipitously, and has happened often enough that the pattern 
>>> seems repeatable (though controlled experiments aren’t really possible, we 
>>> try not to stress our equipment this way).
>>> 
>>> Has anyone else seen this, and does someone perhaps have a suggestion as to 
>>> root cause, or some way to accelerate the reset?
>>> 
>>> Example faults have included:
>>> 
>>> —ADC5G clock not being correctly received, or not being transmitted to 
>>> FPGA, or being transmitted at incorrect speed.
>>> 
>>> —A particular ADC would refuse to calibrate its digital interface to the 
>>> FPGA.
>>> 
>>> —QDRs which don’t calibrate
>>> 
>>> —After a lightening storm on Maunakea we have two units with a single SFP+ 
>>> port among 8 falling to transmit packets, though we have yet to see if an 
>>> extended power down will cure this.
>>> 
>>> Again these faults have been distributed across multiple units, and in all 
>>> cases have eventually been cleared, after extended power down.  Which is 
>>> good, but the pathology worries us.
>>> 
>>> Thanks in advance for any light that might be cast on this issue.
>>> 
>>> Jonathan and André
>>> EHT/SMA
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "[email protected] <mailto:[email protected]>" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected] 
>>> <mailto:casper%[email protected]>.
>>> To post to this group, send email to [email protected] 
>>> <mailto:[email protected]>.
>>> 
>> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "[email protected]" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To post to this group, send email to [email protected] 
> <mailto:[email protected]>.

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Reply via email to