** Also affects: ubuntu-z-systems
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: ubuntu-z-systems
   Importance: Undecided => Medium

** Changed in: ubuntu-z-systems
     Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)

** Changed in: linux (Ubuntu)
     Assignee: Skipper Bug Screeners (skipper-screen-team) => Frank Heimes 
(fheimes)

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1990275

Title:
  [UBUNTU 20.04] Unexpected  LAG affinity behaviour with  mlx5_core
  driver in Ubuntu 20.04

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  == Comment: #0 - KISHORE KUMAR  G <kishor...@in.ibm.com> - 2022-09-19 
04:39:42 ==
  ---Problem Description---
  On a  Ubuntu/s390 system that houses a Mellanox CX5 Adapter  with two ports 
connected to the a pair of TOR switches , act as entry point to cluster of 
compute nodes to access public network ( edge node) with following level of mlx 
firmware :

  ethtool -i p0

  driver: mlx5e_rep
  version: 5.4.0-104.118-
  firmware-version: 16.27.1016 (MT_0000000013)
  expansion-rom-version:
  bus-info: 0100:00:00.0
  supports-statistics: yes
  supports-test: no
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: no


  The LAG affinity module of mlx5_core in upstream 5.4 kernel listens to
  routing events and sets the LAG affinity accordingly , whereas in one
  of  custom services  has  Fabcon service listens to the routing events
  and sets the LAG affinity in the mellanox driver accordingly.

  The edge node routes defined in  compute nodes  use both the two  interfaces 
(port1 -P0 and port2- P1) for the LAG affinity. For instance 
  10.66.0.170 proto bgp src 10.66.11.43 metric 20 
  nexthop via 172.31.22.42 dev p0 weight 1 
  nexthop via 172.31.22.170 dev p1 weight 1

  As an example post an edge node bootup ,  LAG mapping gets converged to use 
both  port1(P0) and port2 (P1) by default 
  root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
  [  282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2              
               
  [  282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2     
(<------ Both ports are equally mapped)

  The issue comes, when the mlx5_core driver  cannot derive the LAG
  configuration from specific routes. For instance,an operation of
  disabling an interface from edge node above (10.66.0.170) or
  addition/removal of the interface, causes mlx5_core driver to listen
  on the routing change and change the LAG affinity to use a single
  network interface only.

  In the following example ,a new static route entry  to a single
  destination  (10.66.47.34) is added  as below

   ip route add 10.66.47.34 proto static src 10.66.11.43 metric 20 via
  172.31.22.42 dev p0

  Caused  the LAG mapping change to port1(p0)   as detected as following

  root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
  [  282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2
  [  282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2
  [  757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:1   
<----mapping directs to go thru P0.

  The above behaviour, causes all the traffic in 10.x to use  a single network 
interface.
  The TOR switches (Fabric) doesn't capture or know  such a  LAG affinity 
change and therefore the packets will be dropped on  "not in use" interface ( 
Eg. Port 2 (P1) ).

  So the mellanox(mlx5_core)  should not be changing the LAG mapping
  /config based on the last route event, rather should rely on the
  default routes only.

  Mellanox agreed to patch this and its is available in  5.15.29  Ubuntu and 
5.15.39 respectively 
  Following are the commits  that resolves this issue .
  1. net/mlx5e: Lag,Only handle events from highest priority multipath entry  . 
Available in upstream  
  Kernel 5.15.29 - 
https://github.com/torvalds/linux/commit/ad11c4f1d8fd1f03639460e425a36f7fd0ea83f5

  2.net/mlx5e: Lag, Don't skip fib events on current dst  .
  
(5.15.29)https://github.com/torvalds/linux/commit/4a2a664ed87962c4ddb806a84b5c9634820bcf55

  )3. net/mlx5e: Lag, Fix fib_info pointer assignment - ( 5.15.39 )
  
https://github.com/torvalds/linux/commit/a6589155ec9847918e00e7279b8aa6d4c272bea7

  4. net/mlx5e: Lag, Fix use-after-free in fib event handler  -
  (5.15.39)

  
https://github.com/torvalds/linux/commit/27b0420fd959e38e3500e60b637d39dfab065645

  
  The request is to have the above commits backported in Ubuntu 20.04.x series  
including the 
  Ubuntu 18.04 HWE kernel

  
   
  Contact Information = Kishore Kumar G/kishore.pil...@in.ibm.com 
utsav.shrivas...@ibm.com 
   
  ---Additional Hardware Info---
  Mellanox CX5 adapter with firmware-version: 16.27.1016 (MT_0000000013)
   

   
  ---uname output---
  Linux version version: 5.4.0-104.118
   
  Machine Type = s390x LPAR 
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   ...
  "
  default proto bgp src 10.66.11.41 metric 20
          nexthop via 172.31.22.40 dev p0 weight 1
          nexthop via 172.31.22.168 dev p1 weight 1"
  ......
  172.31.22.40/31 dev p0 proto kernel scope link src 172.31.22.41  
  172.31.22.168/31 dev p1 proto kernel scope link src 172.31.22.169

  ..

  Also we have around 64 SRIOV devices for VM Consumption.

  In the above  case, the LAG mapping is working as expected as below,
  to use both the ports (p0 and p1) for traffic

  root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag

  [  282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2

  [  282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
  2:2   <<<---behavior expected

  
  The issue comes , when we set an additional route to a single IP in the 
underlying network with a single/one next hop , we observe that all the traffic 
is being shifted to a single next hop port as the example below shows.

  
  root@pok1-qz1-sr1-rk011-s20:/# ip route add 10.66.47.34 proto static src 
10.66.11.41 metric 20 via 172.31.22.40 dev p0

  
  root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag

  [  282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2

  [  282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
  2:2

  [  757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
  2:1   <<<<------- Issue


   
  Stack trace output:
   no
   
  Oops output:
   no
   
  System Dump Info:
    The system is not configured to capture a system dump.
   
  *Additional Instructions for Kishore Kumar G/kishore.pil...@in.ibm.com 
utsav.shrivas...@ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1990275/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to