On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:

The big issue we ran into is if you have redundant REs then there is a super 
bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and 
the other 2 would take a very long time) to 8 days will lock the entire chassis 
up solid where we had to pull the REs physical out to reboot them.     It is 
fixed now, but they had to manually poke new firmware into the ASICs on each RE 
when they were in a half-powered state,  Was a very complex procedure with tech 
support and the MX304 engineering team.  It took about 3 hours to do all 3 
MX304s  one RE at a time.   We have not seen an update with this built-in yet.  
(We just did this back at the end of April)

Oh dear, that's pretty nasty. So did they say new units shipping today would come with the RE's already fixed?

We've been suffering a somewhat similar issue on the PTX1000, where a bug was introduced via regression in Junos 21.4, 22.1 and 22.2 that causes CPU queues to get filled up by unknown MAC address frames, and are not cleared. It takes 64 days for this packet accumulation to grow to a point where the queues get exhausted, causing a host loopback wedge.

You would see an error like this in the logs:

<date> <time> <hostname> alarmd[27630]: Alarm set: FPC id=150995048, color=RED, class=CHASSIS, reason=FPC 0 Major Errors <date> <time> <hostname> fpc0 Performing action cmalarm for error /fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1] (0x20002) in module: Host Loopback with scope: pfe category: functional level: major <date> <time> <hostname> fpc0 Cmerror Op Set: Host Loopback: HOST LOOPBACK WEDGE DETECTED IN PATH ID 1  (URI: /fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1]) Apr 1 03:52:28  PTX1000 fpc0 CMError: /fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3] (0x20004), in module: Host Loopback with scope: pfe category: functional level: major

This causes the router to drop all control plane traffic, which, basically, makes it unusable. One has to reboot the box to get it back up and running, until it happens again 64 days later.

The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.

However, these releases are not shipping yet, so Juniper gave us a workaround SLAX script that automatically runs and clears the CPU queues before the 64 days are up.

We are currently running Junos 22.1R3.9 on this platform, and will move to 22.4R2 in a few weeks to permanently fix this.

Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.

I understand it may also affect the QFX and MX, but I don't have details on that.

Mark.

_______________________________________________
juniper-nsp mailing list [email protected]
https://puck.nether.net/mailman/listinfo/juniper-nsp

Reply via email to