You just cause me to have a flashback.  I told the code writer to 
never-ever-ever use GOTO in C code.  He did anyhow and it caused a stack 
overflow after so many operations.  Had to send him to Japan with a suitcase of 
one time programmable MCUs to change out thousands of units in the field.   
Arrgh.  That’s $25K I will never see again.   

From: Aaron Schneider 
Sent: Thursday, January 21, 2016 6:04 PM
To: [email protected] 
Subject: Re: [AFMUG] Cambium 450 Watchdog resets - was: To Cambium With Love- 
Replace the bad ePMP units.

The issue there is 1) access to all of this data, and 2) being able to act on 
that data.

 

It seems like it would be helpful to have a flood of data like this, but the 
nature of the problem is that once the memory controller goes bonkers, we can’t 
even rely on code to function properly.  This is why it goes to a Watchdog 
Reset.   What is literally happening most often in this issue is that we get a 
bad opcode, we then go to the exception vector.  During the exception handler 
to deal with the bad opcode, we get *another* bad opcode, and then cycle back 
to the exception vector.  Now, the second bad opcode is entrenched in 
instruction cache and the cycle repeats until the watchdog recovers.

 

So even if we could collect data, we couldn’t trust the code that would log 
that data anymore than we could trust the normal running code anymore.

 

The key to these types of issues is local reproducibility so we can iterate 
very quickly and often as we learn new things about the problem in each 
iteration.   In order to do that, we need direct, within 6 feet, access to the 
physical hardware.  So at one point, we were looking to go out to customers 
that could see this on a bench and not on a tower and debug in place.  Before 
that happened, we finally got a break in reproducing it in our lab repeatedly.

 

 

 

From: Af [mailto:[email protected]] On Behalf Of Josh Luthman
Sent: Thursday, January 21, 2016 6:59 PM
To: [email protected]
Subject: Re: [AFMUG] Cambium 450 Watchdog resets - was: To Cambium With Love- 
Replace the bad ePMP units.

 

Would it be helpful to have a test or memory dump load for the APs it's 
happening consistently on?  Rather than reproducing it in the lab, just use 
real repeating units.

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

On Jan 21, 2016 7:50 PM, "Aaron Schneider" 
<[email protected]> wrote:

  Hi Everyone –

   

  Sorry for the delay in response on this thread.  I’d like to give an update 
of where we are with this issue.

   

  First off, I would like to apologize for  the issues that this is causing.  
We have heard reports for awhile in varying fashion, and Tushar had been 
talking about having things like this for quite some time, but we were having 
issues finding some correlation between reports (configuration, network 
topology, etc), as well as being unable to recreate the issue in our lab on 
demand.  This issue appears to have definitely got worse in the 13.4 release 
and is becoming more widespread as the weather turns.    

   

  What we have found out in the last several weeks is that there is an issue 
with the memory controller code in the FPGA.  What this leads to is memory 
coherency being lost which actually has now been verified to lead to several 
issues.  We had seen reports of various resets over time but had no reason to 
correlate them to one root cause until now.   The most prevalent of these is 
the Watchdog Reset without any accompanying crash log.  The other issues with 
the same root cause are the Illegal Instruction crash, the Invalid NiBuf crash, 
as well as any Null Exception Handler crash.  The bottom line is, when memory 
contents glitch on your software, it depends on when it happens as to what the 
outcome is.  We have found this to be very reproducible at very cold 
temperatures (-20C -- -50C), but it has been seen and reported at higher 
temperatures, just not as often.

   

  The nature of the FPGA based memory controller is that there can be timing 
issues that get exacerbated at extreme temperatures.  If you don’t have proper 
constraints in place for a given signal path, its timing characteristics can 
change on you as temperature changes.  Also, if you don’t have a proper 
constraint in place, even recompiling the FPGA can change the characteristics 
that then make what used to work fine susceptible to extremes.   Something 
happened with the 13.4 FPGA that brought this to the edge such that it is now a 
problem and as we are seeing with winter cold coming in, becoming much more 
prevalent at cold temperatures.  13.4 and 13.4.1 have the same FPGA.  14.1.2 
has a new FPGA and there have been some improvements made in this area, but we 
have found it is still susceptible to the problem.

   

  We are reproducing the problem in our lab and we have multiple developers 
digging in to figure out what is going on.  These types of issues with timing 
are generally very difficult to find and fix, but this is our highest priority 
right now and we will not have another release until this is fixed.  

   

  I’ve talked mostly about 13.4 and 13.4.1 here, but the nature of this issue 
and how it can interact with hardware doesn’t preclude it from having been the 
cause of the issues some (like Tushar) have seen over time.  Once we have a fix 
for this, we will be adding more rigorous regression testing including an 
internal HW memory test to validate that this type of memory issue doesn’t come 
back again.

   

  From what we’ve seen and heard, this issue only affects the 450 AP FPGA and 
is not an issue on the 450 SM, 430AP/SM, nor the 450i devices.  The 450i is a 
very different architecture and has a hardware based memory controller and 
watchdog timer whereas on the 450/430 based devices, these items are in the 
FPGA.

   

   

  Again, I apologize for the severe inconvenience and realize that it is 
getting colder and colder in NA so we are racing against the clock with this.   
As soon as we have any updates and new open beta loads with a fix, I’ll let you 
know.

   

  I appreciate your patience.

   

  Regards,

  -Aaron

   

  From: Af [mailto:[email protected]] On Behalf Of Brian Sullivan
  Sent: Thursday, January 21, 2016 4:11 PM
  To: [email protected]
  Subject: Re: [AFMUG] Cambium 450 Watchdog resets - was: To Cambium With Love- 
Replace the bad ePMP units.

   

  I was assured today that the issue isn't the hardware.� Evidently this 
issue can be solved with an upcoming software upgrade.
  Time will tell.
  
http://community.cambiumnetworks.com/t5/PMP-450/13-2-to-13-4-System-Reset-Exception-Watchdog-Reset/td-p/43347/page/2

  On 1/21/2016 4:02 PM, Joe Falaschi wrote:

    We have some APs that have uptime over 60 days but many reboot every 1-3 
weeks. �This is definitely an outlier. �We've been in contact with Cambium 
on this via an open ticket and sending them all of the information they request 
and nobody has said oh gosh that is bad hardware RMA it. �So, we're just 
going around and around. �We'll end up just replacing it and hoping they will 
take it back because obviously this is bad. �We are running 14.x per their 
request. �We saw this on 13.x as well. 

     

    Joe

     

     

    On Jan 21, 2016, at 12:05 PM, Ken Hohhof wrote:

     

      Joe, that is seriously bad.� I see watchdog resets and a few stack 
dumps, but uptime on 450 APs is typically 2-4 weeks, despite the recent cold 
weather, in fact I don�t think it has been more common than it was last 
summer.� I have not gone to 14.x though, everything is still on 13.2.

      �

      So either you have a bad unit, or 14.x is making it much worse.� If 
everyone was seeing resets every few minutes or hours, I think there would be 
villagers with torches and pitchforks outside Cambium HQ.

      �

      Brian from FVI does have a thread on the Cambium Community about this.

      �

      FWIW, I have one 450i 900 MHz which necessarily is on 14.1, and it does 
not appear to be having watchdog resets.� Lightly loaded however, just 2 subs.

      �

      �

      From: Joe Falaschi 

      Sent: Thursday, January 21, 2016 11:34 AM

      To: [email protected] 

      Subject: Re: [AFMUG] Cambium 450 Watchdog resets - was: To Cambium With 
Love- Replace the bad ePMP units.

      �

      We see a ton of reboots on the 450 platform as well.� It's getting 
pretty frustrating simply because this is such a long term issue.� One of my 
APs has rebooted 195 times (now running 14.1.2).� They are saying we should 
replace the AP but it is unclear if we can RMA it or not.� We do have an open 
ticket. 

      �

      Joe Falaschi

      e-vergent

      �

      �

      �

      <Screen Shot 2016-01-21 at 11.30.16 AM.png>

      On Jan 20, 2016, at 9:26 PM, Mark Radabaugh wrote:

       

        Hum�� sounds very similar.�� It�s temperature sensitive as 
well - gets far worse with low temperatures, and we are having pretty cold 
temps this week.�� 

        �

        Extremely frustrating and causing real customer complaints.

        �

        Mark

        �

          On Jan 20, 2016, at 9:28 PM, Tushar Patel <[email protected]> wrote:

          �

          Over two years we have been seeing random reboot. We were told over 
and over again you are the only one.� Then few people started reporting.

          �

          But cambium never could get bottom of the problems for two years so, 
I gave up on cambium fixing this random reboot.� We stop calling them about 
it. 

          �

          As the new versions of the software has come out over two years we 
have see the frequency of the problem reduce but not gone away.

          Tushar 

          �


          On Jan 20, 2016, at 6:25 PM, Mark Radabaugh <[email protected]> wrote:

            Tushar, 

            �

            What did you give up on?�� Or do?

            �

            Please note the mailing and shipping address change below:

            �

            Mark Radabaugh
            Amplex
            22690 Pemberville Rd

            Luckey, OH 43443
            419-837-5015 x1021
            [email protected]

            �

              On Jan 20, 2016, at 4:49 PM, Tushar Patel <[email protected]> wrote:

              �

              That's what they used to tell us too.� We have given up on the 
subject now. 

              Tushar 

              �


              On Jan 20, 2016, at 1:09 PM, Mark Radabaugh <[email protected]> 
wrote:

                Wait - they keep telling us we are the only ones that this 
happens to with 450?

                �

                So who else is having reboot-o-rama with 450�s?

                �

                Mark

                �

                  On Jan 20, 2016, at 1:20 PM, Brian Sullivan 
<[email protected]> wrote:

                  �

                  I wish they would fix/replace the bad 450 AP's that suffer 
from Watchdog Resets.� 
                  Although replacing 100 450 AP's is cheaper than ePMP.� :-/

                  On 1/20/2016 12:11 PM, Josh Luthman wrote:

                    Why would making the memory faster degrade performance?

                    �

                    �

                    Josh Luthman
                    Office: 937-552-2340
                    Direct: 937-552-2343
                    1100 Wayne St
                    Suite 1337
                    Troy, OH 45373

                    �

                    On Wed, Jan 20, 2016 at 1:00 PM, Tyson Burris @ Internet 
Communications Inc <[email protected]> wrote:

                      Hello Cambium,


                      �

                      At the MidWest-IX launch party last night, several of us 
Indiana WISPs compared notes on the �cold weather� problems we are seeing 
with ePMPs.� It was very interesting to learn we are experience identical 
problems across the spectrum. 

                      We all understand this is a DRAM issue with certain units 
you have identified.� We also understand the firmware RC that has been made 
available to fix this short term.

                      The bottom line is we are very frustrated and grow tired 
of dealing with it.� 


                      �

                      Our concern is simple.� If your software fix 
�degrades� the performance of the product or triggers other issues, as it 
has been suggested, we would prefer a full recall and replacement program 
immediately.


                      �

                      If the suggestion that the fix will degrade the product 
performance is inaccurate and not cause other issues, I would like for this to 
be made public.� 


                      �

                      Thank you,


                      �

                      Tyson Burris, President 
                      Internet Communications Inc. 
                      739 Commerce Dr. 
                      Franklin, IN 46131 
                      � 
                      317-738-0320 Daytime # 
                      317-412-1540 Cell/Direct # 
                      Online: www.surfici.net 


                      �

                      <Mail Attachment.png>

                      What can ICI do for you? 


                      Broadband Wireless - PtP/PtMP Solutions - WiMax - Mesh 
Wifi/Hotzones - IP Security - Fiber - Tower - Infrastructure. 
                      � 
                      CONFIDENTIALITY NOTICE: This e-mail is intended for the 
                      addressee shown. It contains information that is 
                      confidential and protected from disclosure. Any review, 
                      dissemination or use of this transmission or its contents 
by 
                      unauthorized organizations or individuals is strictly 
                      prohibited.

                      �

                      �


                      �

                    �

                   

                �

            �

        �

      �

     

   

Reply via email to