When in doubt, try to kill it with fire... or ice.

I wonder if this is related. I've had a couple 3.6 clusters start randomly dropping sessions the past week or so while it's been cold. Most SMs can't re-register. SMs with HP definitely cannot re-register and say the HP VC was stuck and cleared a few times in their logs. APs are rebooted and all is clear. Mostly at night. I figured it was some traffic overload condition, until it happened at 4am where the traffic is at minimum.

Then similar things happened back during the summer, except they kept failing to register due to "out of range" in the reg fail list. Again, have to reboot the APs to fix it.

And in both cases, a LBT hit seems to trigger this. However, I've seen the same thing happen on 5.7 sectors where there is obviously no LBT.

Or how about sync, no sync, sync, no sync, sync, no sync until the AP is rebooted.

Can bad stuff in memory do all kinds of weird shit just like this? I hope this is the root of all this, because I'm out of things to try.. and sanity.

On 1/21/2016 6:50 PM, Aaron Schneider wrote:

Hi Everyone �

Sorry for the delay in response on this thread. I�d like to give an update of where we are with this issue.

First off, I would like to apologize for the issues that this is causing. We have heard reports for awhile in varying fashion, and Tushar had been talking about having things like this for quite some time, but we were having issues finding some correlation between reports (configuration, network topology, etc), as well as being unable to recreate the issue in our lab on demand. This issue appears to have definitely got worse in the 13.4 release and is becoming more widespread as the weather turns.

What we have found out in the last several weeks is that there is an issue with the memory controller code in the FPGA. What this leads to is memory coherency being lost which actually has now been verified to lead to several issues. We had seen reports of various resets over time but had no reason to correlate them to one root cause until now. The most prevalent of these is the Watchdog Reset without any accompanying crash log. The other issues with the same root cause are the Illegal Instruction crash, the Invalid NiBuf crash, as well as any Null Exception Handler crash. The bottom line is, when memory contents glitch on your software, it depends on when it happens as to what the outcome is. We have found this to be very reproducible at very cold temperatures (-20C -- -50C), but it has been seen and reported at higher temperatures, just not as often.

The nature of the FPGA based memory controller is that there can be timing issues that get exacerbated at extreme temperatures. If you don�t have proper constraints in place for a given signal path, its timing characteristics can change on you as temperature changes. Also, if you don�t have a proper constraint in place, even recompiling the FPGA can change the characteristics that then make what used to work fine susceptible to extremes. Something happened with the 13.4 FPGA that brought this to the edge such that it is now a problem and as we are seeing with winter cold coming in, becoming much more prevalent at cold temperatures. 13.4 and 13.4.1 have the same FPGA. 14.1.2 has a new FPGA and there have been some improvements made in this area, but we have found it is still susceptible to the problem.

We are reproducing the problem in our lab and we have multiple developers digging in to figure out what is going on. These types of issues with timing are generally very difficult to find and fix, but this is our highest priority right now and we will not have another release until this is fixed.

I�ve talked mostly about 13.4 and 13.4.1 here, but the nature of this issue and how it can interact with hardware doesn�t preclude it from having been the cause of the issues some (like Tushar) have seen over time. Once we have a fix for this, we will be adding more rigorous regression testing including an internal HW memory test to validate that this type of memory issue doesn�t come back again.

From what we�ve seen and heard, this issue only affects the 450 AP FPGA and is not an issue on the 450 SM, 430AP/SM, nor the 450i devices. The 450i is a very different architecture and has a hardware based memory controller and watchdog timer whereas on the 450/430 based devices, these items are in the FPGA.

Again, I apologize for the severe inconvenience and realize that it is getting colder and colder in NA so we are racing against the clock with this. As soon as we have any updates and new open beta loads with a fix, I�ll let you know.

I appreciate your patience.

Regards,

-Aaron

*From:*Af [mailto:[email protected]] *On Behalf Of *Brian Sullivan
*Sent:* Thursday, January 21, 2016 4:11 PM
*To:* [email protected]
*Subject:* Re: [AFMUG] Cambium 450 Watchdog resets - was: To Cambium With Love- Replace the bad ePMP units.

I was assured today that the issue isn't the hardware.� Evidently this issue can be solved with an upcoming software upgrade.
Time will tell.
http://community.cambiumnetworks.com/t5/PMP-450/13-2-to-13-4-System-Reset-Exception-Watchdog-Reset/td-p/43347/page/2

On 1/21/2016 4:02 PM, Joe Falaschi wrote:

    We have some APs that have uptime over 60 days but many reboot
    every 1-3 weeks. �This is definitely an outlier. �We've been
    in contact with Cambium on this via an open ticket and sending
    them all of the information they request and nobody has said oh
    gosh that is bad hardware RMA it. �So, we're just going around
    and around. �We'll end up just replacing it and hoping they will
    take it back because obviously this is bad. �We are running 14.x
    per their request. �We saw this on 13.x as well.

    Joe

    On Jan 21, 2016, at 12:05 PM, Ken Hohhof wrote:



        Joe, that is seriously bad.� I see watchdog resets and a few
        stack dumps, but uptime on 450 APs is typically 2-4 weeks,
        despite the recent cold weather, in fact I don�t think it
        has been more common than it was last summer.� I have not
        gone to 14.x though, everything is still on 13.2.

        �

        So either you have a bad unit, or 14.x is making it much
        worse.� If everyone was seeing resets every few minutes or
        hours, I think there would be villagers with torches and
        pitchforks outside Cambium HQ.

        �

        Brian from FVI does have a thread on the Cambium Community
        about this.

        �

        FWIW, I have one 450i 900 MHz which necessarily is on 14.1,
        and it does not appear to be having watchdog resets.�
        Lightly loaded however, just 2 subs.

        �

        �

        *From:*Joe Falaschi <mailto:[email protected]>

        *Sent:*Thursday, January 21, 2016 11:34 AM

        *To:*[email protected] <mailto:[email protected]>

        *Subject:*Re: [AFMUG] Cambium 450 Watchdog resets - was: To
        Cambium With Love- Replace the bad ePMP units.

        �

        We see a ton of reboots on the 450 platform as well.� It's
        getting pretty frustrating simply because this is such a long
        term issue.� One of my APs has rebooted 195 times (now
        running 14.1.2).� They are saying we should replace the AP
        but it is unclear if we can RMA it or not.� We do have an
        open ticket.

        �

        Joe Falaschi

        e-vergent

        �

        �

        �

        <Screen Shot 2016-01-21 at 11.30.16 AM.png>

        On Jan 20, 2016, at 9:26 PM, Mark Radabaugh wrote:



            Hum�� sounds very similar.�� It�s temperature
            sensitive as well - gets far worse with low temperatures,
            and we are having pretty cold temps this week.��

            �

            Extremely frustrating and causing real customer complaints.

            �

            Mark

            �

                On Jan 20, 2016, at 9:28 PM, Tushar Patel
                <[email protected] <mailto:[email protected]>> wrote:

                �

                Over two years we have been seeing random reboot. We
                were told over and over again you are the only one.�
                Then few people started reporting.

                �

                But cambium never could get bottom of the problems for
                two years so, I gave up on cambium fixing this random
                reboot.� We stop calling them about it.

                �

                As the new versions of the software has come out over
                two years we have see the frequency of the problem
                reduce but not gone away.

                Tushar

                �


                On Jan 20, 2016, at 6:25 PM, Mark Radabaugh
                <[email protected] <mailto:[email protected]>> wrote:

                    Tushar,

                    �

                    What did you give up on?�� Or do?

                    �

                    Please note the mailing and shipping address
                    change below:

                    �

                    Mark Radabaugh
                    Amplex
                    22690 Pemberville Rd

                    Luckey, OH 43443
                    419-837-5015 x1021
                    [email protected] <mailto:[email protected]>

                    �

                        On Jan 20, 2016, at 4:49 PM, Tushar Patel
                        <[email protected] <mailto:[email protected]>> wrote:

                        �

                        That's what they used to tell us too.� We
                        have given up on the subject now.

                        Tushar

                        �


                        On Jan 20, 2016, at 1:09 PM, Mark Radabaugh
                        <[email protected] <mailto:[email protected]>> wrote:

                            Wait - they keep telling us we are the
                            only ones that this happens to with 450?

                            �

                            So who else is having reboot-o-rama with
                            450�s?

                            �

                            Mark

                            �

                                On Jan 20, 2016, at 1:20 PM, Brian
                                Sullivan <[email protected]
                                <mailto:[email protected]>> wrote:

                                �

                                I wish they would fix/replace the bad
                                450 AP's that suffer from Watchdog
                                Resets.�
                                Although replacing 100 450 AP's is
                                cheaper than ePMP.� :-/

                                On 1/20/2016 12:11 PM, Josh Luthman wrote:

                                    Why would making the memory faster
                                    degrade performance?

                                    �

                                    �

                                    Josh Luthman
                                    Office: 937-552-2340
                                    Direct: 937-552-2343
                                    1100 Wayne St
                                    Suite 1337
                                    Troy, OH 45373

                                    �

                                    On Wed, Jan 20, 2016 at 1:00 PM,
                                    Tyson Burris @ Internet
                                    Communications Inc
                                    <[email protected]
                                    <mailto:[email protected]>> wrote:

                                        Hello Cambium,


                                        �

                                        At the MidWest-IX launch party
                                        last night, several of us
                                        Indiana WISPs compared notes
                                        on the �cold weather�
                                        problems we are seeing with
                                        ePMPs.� It was very
                                        interesting to learn we are
                                        experience identical problems
                                        across the spectrum.

                                        We all understand this is a
                                        DRAM issue with certain units
                                        you have identified.� We
                                        also understand the firmware
                                        RC that has been made
                                        available to fix this short term.

                                        The bottom line is we are very
                                        frustrated and grow tired of
                                        dealing with it.�


                                        �

                                        Our concern is simple.� If
                                        your software fix
                                        �degrades� the performance
                                        of the product or triggers
                                        other issues, as it has been
                                        suggested, we would prefer a
                                        full recall and replacement
                                        program immediately.


                                        �

                                        If the suggestion that the fix
                                        will degrade the product
                                        performance is inaccurate and
                                        not cause other issues, I
                                        would like for this to be made
                                        public.�


                                        �

                                        Thank you,


                                        �

                                        *Tyson Burris, President**
                                        **Internet Communications Inc.**
                                        **739 Commerce Dr.**
                                        **Franklin, IN 46131**
                                        **�*
                                        *317-738-0320
                                        <tel:317-738-0320> Daytime #*
                                        *317-412-1540
                                        <tel:317-412-1540> Cell/Direct #*
                                        *Online: **www.surfici.net
                                        <http://www.surfici.net>*


                                        �

                                        <Mail Attachment.png>

                                        *What can ICI do for you?*


                                        *Broadband Wireless - PtP/PtMP
                                        Solutions - WiMax - Mesh
                                        Wifi/Hotzones - IP Security -
                                        Fiber - Tower - Infrastructure.*
                                        *�*
                                        *CONFIDENTIALITY NOTICE: This
                                        e-mail is intended for the*
                                        *addressee shown. It contains
                                        information that is*
                                        *confidential and protected
                                        from disclosure. Any review,*
                                        *dissemination or use of this
                                        transmission or its contents by*
                                        *unauthorized organizations or
                                        individuals is strictly*
                                        *prohibited.*

                                        �

                                        �


                                        �

                                    �

                            �

                    �

            �

        �


Reply via email to