I’m assuming you’ve tried the obvious “it’s the cable stupidity” rule outs such 
as replacing the involved physical components like cables or SFPs. After that, 
the problem likely is LACP configuration.


As you may know, LACP doesn't use a single "LACP algorithm" for distributing 
packets across links. Instead you configure one of the available hash-based 
distribution functions the two endpoints have in common.  The hash uses packet 
header information to distribute outgoing traffic across the LAG. Common hash 
algorithms include options to balance traffic based on combinations of Layer 2, 
3 and 4 addresses, such as source and destination MAC addresses, source and 
destination IP addresses, or source and destination TCP/UDP ports. The best 
choice depends on the specific network traffic and desired distribution.


I’ve found that sometimes with LAGs between different equipment vendors one or 
more of these algorithms aren’t compatible, resulting in packets out of order 
or even dropped.


For example, Cisco and Juniper have different implementations of LACP hashing 
with similar names. But under the covers, Juniper allows finer-grained control 
over the specific Layer 2, Layer 3, and Layer 4 fields used for hashing through 
the forwarding-options hash-key configuration, while Cisco offers just a few 
fixed hash modes like Layer 2, Layer 3, and Layer 4, with the specific details 
of the hashing algorithm being proprietary.


In my own experience, packet loss on Cisco-Juniper LACP links has arisen from 
inconsistent or incompatible configurations. You can troubleshoot by checking 
LACP status and interface counters on both sides, ensuring compatible settings 
like LACP rate. I’ve even seen duplex flapping! Be sure to look at logs on both 
ends for hardware errors or weird messages. If the issue persists, try 
adjusting LACP parameters, and testing using single active member links.


Have you tried switching to a different algorithm?


 -mel beckman

On Sep 25, 2025, at 8:22 PM, Andy Cole via NANOG <[email protected]> wrote:

Group,
 I've been Peering with both Route Servers in the Dallas IX for over a
month using a single 10G link with no issues. Due to capacity concerns I
had to augment to a 20G LAG. In order to do this, I shut the existing link
down (which dropped both eBGP sessions), used the existing IP space to
create the LAG, and then added the 2nd 10G link. The eBGP sessions
reestablished over the LAG and traffic started flowing error free. No
configuration changes to routing policy at all.  After a few days we
started to get customer complaints for certain sites/domains being
unreachable. I worked around the issue by not announcing the customer
blocks to the route servers and changed the return path to traverse
transit. This solved the issue, but I'm perplexed as to what could've
caused the issue, and where to look to resolve it.  If you guys could
provide feedback and point me in the right direction I'd appreciate it. TIA.

~Andy
_______________________________________________
NANOG mailing list
https://lists.nanog.org/archives/list/[email protected]/message/VQJ37BWPQRQYQB6QMWG6E6SVUDHNYDTO/
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/[email protected]/message/WL2A4FVEA36XP52LRYKQEXRWB55HL37S/

Reply via email to