Vincent Ficet wrote:
Yevgeny,

Hi Vincent,

Vincent Ficet wrote:
Hello,

Following the QoS experiments I carried out yesterday, I wanted to set
up 3 IP networks, each one bound to a particular pkey, in order to
achieve QoS for each network.
Unfortunately, it seems that something is not mapped properly in the ULP
layers (vlarb tables are fine).

The settings are as follows:

opensm.conf:
------------

qos_max_vls    8
qos_high_limit 1
qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
qos_vlarb_low  0:8,1:1,2:1,3:4,4:0,5:0
qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Please check section 7 of the QoS_management_in_OpenSM.txt
doc. It explains what exactly is the meaning of the values
in the VLArb table. It also has explanation of the problem
that you're seeing. Quoting from there:

"Keep in mind that ports usually transmit packets of
 size equal to MTU. For instance, for 4KB MTU a single
 packet will require 64 credits, so in order to achieve
 effective VL arbitration for packets of 4KB MTU, the
 weighting values for each VL should be multiples of 64."

OK, I see the point.

To check that it works as you said. we changed the IPoIB MTU from 2044
to 2000 in order to make sure that it fits into the IB MTU. which is set
to 2K on our cluster.
In theory, such a 2K packet would require 32 packets (credits) of 64 bytes.

We changed the vlarb tables with increments of 32 (for VL 1,2,3):

qos_max_vls    8
qos_high_limit 1
qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
qos_vlarb_low  0:8,1:32,2:64,3:96,4:0,5:0
qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

and we also tried increments of 64:

qos_max_vls    8
qos_high_limit 1
qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
qos_vlarb_low  0:8,1:64,2:128,3:192,4:0,5:0
qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

But still, it does not make any difference:

 [r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-ic0 -t
20 2>&1; done | grep Gbits/sec
[  3]  0.0-20.0 sec  13.0 GBytes  5.57 Gbits/sec
[  3]  0.0-20.0 sec  12.9 GBytes  5.53 Gbits/sec
[  3]  0.0-20.0 sec  12.0 GBytes  5.17 Gbits/sec

[r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-backbone
-t 20 2>&1; done | grep Gbits/sec
[  3]  0.0-20.0 sec  13.1 GBytes  5.61 Gbits/sec
[  3]  0.0-20.0 sec  11.9 GBytes  5.09 Gbits/sec
[  3]  0.0-20.0 sec  9.43 GBytes  4.05 Gbits/sec

[r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-admin -t
20 2>&1; done | grep Gbits/sec
[  3]  0.0-20.0 sec  10.5 GBytes  4.50 Gbits/sec
[  3]  0.0-20.0 sec  12.3 GBytes  5.28 Gbits/sec
[  3]  0.0-20.0 sec  12.0 GBytes  5.15 Gbits/sec

Any other idea ?

OK, so there are three possible reasons that I can think of:
1. Something is wrong in the configuration.
2. The application does not saturate the link, thus QoS
  and the whole VL arbitration thing doesn't kick in.
3. There's some bug, somewhere.

Let's start with reason no. 1.
Please shut off each of the SLs one by one, and
make sure that the application gets zero BW on
these SLs. You can do it by mapping SL to VL15:

qos_sl2vl      0,15,2,3,4,5,6,7,8,9,10,11,12,13,14,15

and then
qos_sl2vl      0,1,15,3,4,5,6,7,8,9,10,11,12,13,14,15

and then
qos_sl2vl      0,1,2,15,4,5,6,7,8,9,10,11,12,13,14,15

If this part works well, then we will continue to
reason no. 2.

-- Yevgeny


Thanks for your help.

Vincent

-- Yevgeny


The corresponding VLArb tables are fine on both the server (pichu16) and
the client (pichu22):

[r...@pichu22 network-scripts]# smpquery vlarb -D 0
# VLArbitration tables: DR path slid 65535; dlid 65535; 0 port 0 LowCap
8 HighCap 8
# Low priority VL Arbitration Table:
VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
WEIGHT: |0x8 |0x1 |0x1 |0x4 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

[r...@pichu16 ~]# smpquery vlarb -D 0
# VLArbitration tables: DR path slid 65535; dlid 65535; 0 port 0 LowCap
8 HighCap 8
# Low priority VL Arbitration Table:
VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
WEIGHT: |0x8 |0x1 |0x1 |0x4 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

partitions.conf:
---------------

default=0x7fff,ipoib            : ALL=full;
ip_backbone=0x0001,ipoib        : ALL=full;
ip_admin=0x0002,ipoib            : ALL=full;

qos-policy.conf:
---------------

qos-ulps
    default                : 0 # default SL
    ipoib, pkey 0x7FFF     : 1 # IP with default pkey 0x7FFF
    ipoib, pkey 0x1        : 2 # backbone IP with pkey 0x1
    ipoib, pkey 0x2        : 3 # admin IP with pkey 0x2
end-qos-ulps

Assigned IP addresses (in /etc/hosts):
-------------------------------------

10.12.1.4       pichu16-ic0             # default IPoIB network, pkey
0x7FFF
10.13.1.4       pichu16-backbone        # IPoIB backbone network,
pkey 0x1
10.14.1.4       pichu16-admin           # IPoIB admin network, pkey 0x2
10.12.1.10      pichu22-ic0             # default IPoIB network, pkey
0x7FFF
10.13.1.10      pichu22-backbone        # IPoIB backbone network,
pkey 0x1
10.14.1.10      pichu22-admin           # IPoIB admin network, pkey 0x2

Note that the netmask is /16, so the -ic0, -backbone and -admin networks
cannot see each other.

IPoIB settings on server side:
------------------------------

[r...@pichu16 ~]# tail -n 5 /etc/sysconfig/network-scripts/ifcfg-ib0*
==> /etc/sysconfig/network-scripts/ifcfg-ib0 <==
BOOTPROTO=static
IPADDR=10.12.1.4
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

==> /etc/sysconfig/network-scripts/ifcfg-ib0.8001 <==
BOOTPROTO=static
IPADDR=10.13.1.4
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

==> /etc/sysconfig/network-scripts/ifcfg-ib0.8002 <==
BOOTPROTO=static
IPADDR=10.14.1.4
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

[r...@pichu16 ~]# ip addr show ib0
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP qlen 256
    link/infiniband
80:00:00:48:fe:80:00:00:00:00:00:00:2c:90:00:10:0d:00:05:6d brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.12.1.4/16 brd 10.12.255.255 scope global ib0
    inet 10.13.1.4/16 brd 10.13.255.255 scope global ib0
    inet 10.14.1.4/16 brd 10.14.255.255 scope global ib0
    inet6 fe80::2e90:10:d00:56d/64 scope link
       valid_lft forever preferred_lft forever

IPoIB settings on client side:
------------------------------

[r...@pichu22 ~]# tail -n 5 /etc/sysconfig/network-scripts/ifcfg-ib0*
==> /etc/sysconfig/network-scripts/ifcfg-ib0 <==
BOOTPROTO=static
IPADDR=10.12.1.10
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

==> /etc/sysconfig/network-scripts/ifcfg-ib0.8001 <==
BOOTPROTO=static
IPADDR=10.13.1.10
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

==> /etc/sysconfig/network-scripts/ifcfg-ib0.8002 <==
BOOTPROTO=static
IPADDR=10.14.1.10
NETMASK=255.255.0.0
ONBOOT=yes
MTU=2044

[r...@pichu22 ~]# ip addr show ib0
48: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP qlen 256
    link/infiniband
80:00:00:48:fe:80:00:00:00:00:00:00:2c:90:00:10:0d:00:06:79 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.12.1.10/16 brd 10.12.255.255 scope global ib0
    inet 10.13.1.10/16 brd 10.13.255.255 scope global ib0
    inet 10.14.1.10/16 brd 10.14.255.255 scope global ib0
    inet6 fe80::2e90:10:d00:679/64 scope link
       valid_lft forever preferred_lft forever

Iperf servers on server side:
-----------------------------

Quoting from iperf help:
  -B, --bind      <host>   bind to <host>, an interface or multicast
address
  -s, --server             run in server mode

Each iperf server is bound to a dedicated interface as follows:

[r...@pichu16 ~]# iperf -s -B pichu16-backbone
[r...@pichu16 ~]# iperf -s -B pichu16-admin
[r...@pichu16 ~]# iperf -s -B pichu16-ic0

Iperf clients on client side:
-----------------------------

Quoting from iperf help:
  -c, --client    <host>   run in client mode, connecting to <host>
  -t, --time      #        time in seconds to transmit for (default
10 secs)

And each iperf client talks to the corresponding iperf server:

[r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-ic0 -t
100 2>&1; done | grep Gbits/sec
[  3]  0.0-100.0 sec  64.6 GBytes  5.55 Gbits/sec
[  3]  0.0-100.0 sec  64.5 GBytes  5.54 Gbits/sec
[  3]  0.0-100.0 sec  60.5 GBytes  5.20 Gbits/sec
[r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-backbone
-t 100 2>&1; done | grep Gbits/sec
[  3]  0.0-100.0 sec  64.8 GBytes  5.57 Gbits/sec
[  3]  0.0-100.0 sec  56.7 GBytes  4.87 Gbits/sec
[  3]  0.0-100.0 sec  59.7 GBytes  5.13 Gbits/sec
[r...@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-admin -t
100 2>&1; done | grep Gbits/sec
[  3]  0.0-100.0 sec  57.3 GBytes  4.92 Gbits/sec
[  3]  0.0-100.0 sec  61.6 GBytes  5.29 Gbits/sec
[  3]  0.0-100.0 sec  62.7 GBytes  5.38 Gbits/sec

Given the VLarb weights assigned (1 for *-ic0 on VL1, 1 for *-backbone
on VL2 and 4 for *-admin on VL3), we would expect different b/w figures
for the *-admin network.
As we can see, all iperf values are the same, showing that QoS is not
enforced on a per pkey basis.
It seems to me that something is not mapped properly in the ULP layers.
Could anyone tell me if I'm wrong here ? If not, is that a known issue ?

Thanks for your help,

Vincent





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to