Hi All,

I have written a fix for the issue initially documented in 6696163, for which I have now opened 6817942 to get one bug for one issue (BTW: could someone please correct my typo in the 6817942 summary: s:>=:>:).

The root cause is documented in 6696163: When choosing a connection for an RPC call, the old connmgr_get code used lbolt timestamps which don't have enough precision (nowadays) to yield proper load balancing.

I have implemented round robin load balancing: Whenever a connection is requested via connmgr_get, it is noted as last used. Upon the next connection request, the connection which comes next in the list of connections (or the first one if the last one was used last) is returned.

I have left a comment in the code explaining why I chose the particular approach and what I think should be done when someone comes to do a larger rewrite of the code in question.

I would appreciate reviews of my suggested fix, which is available at

http://cr.opensolaris.org/~nigoroll/rpc_loadbalancing_6817942/

Thanks, Nils

--

Tests I conducted:

#### PREPARATIONS

> clnt_max_conns/D
clnt_max_conns:
clnt_max_conns: 4

> rpclog/W b
rpclog:         0               =       0xb

ifconfig e1000g0 addif 192.168.77.100/24 up deprecated
share -o rw=localhost:192.168.77.29,root=localhost:192.168.77.29 \
/var/tmp/s
mkdir /tmp/c
cd ; umount /tmp/c ; \
mount -o vers=3,proto=tcp 192.168.77.100:/var/tmp/s /tmp/c ; \
cd /tmp/c

#### TEST BALANCING

> rpclog/W 2
rpclog:         0               =       0x2
> clnt_max_conns/D
clnt_max_conns:
clnt_max_conns: 4

- mount share (see above)
- ls -als

haggis:/tmp/c# egrep '19:57:.*call going out' /var/adm/messages
Mar 16 19:58:10 haggis rpcmod: [ID 595603 kern.notice] clnt_tli_kinit: COTS 
selected
Mar 16 19:58:10 haggis rpcmod: [ID 359612 kern.notice] clnt_cots_kcallit, 
procnum 4
Mar 16 19:58:10 haggis rpcmod: [ID 832978 kern.notice] clnt_cots_kcallit: wait.tv_sec: 60 Mar 16 19:58:10 haggis rpcmod: [ID 359371 kern.notice] clnt_cots_kcallit: wait.tv_usec: 0 Mar 16 19:57:22 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f4598c38 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b3c8 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b5f0 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b760 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f4598c38 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b3c8 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b5f0 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b760 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f4598c38 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b3c8 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b5f0 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f459b760 Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going out on ffffff01f4598c38
[...etc...]

#### Test balancing throughput (can't to a real performance test on a single
#### laptop, this is more a test against massive performance regression)

haggis:/tmp/c# iostat -xznM 5 & dd if=1g_1 of=/dev/zero bs=$((1024*1024))&
[2] 100689
[3] 100690
haggis:/tmp/c#                     extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   24.1    7.5    1.1    0.1  0.8  0.3   24.3    8.1  12  20 c5t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 122.5 0 0 192.168.77.100:/var/tmp/s
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  486.4   10.9   60.6    0.1  2.4 28.1    4.7   56.5  77 100 c5t0d0
1907.1 0.0 59.6 0.0 0.4 1.8 0.2 0.9 18 97 192.168.77.100:/var/tmp/s
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  206.5    1.2   25.8    0.0  2.4 11.8   11.7   56.7  40  40 c5t0d0
4199.2 0.0 131.2 0.0 0.5 2.2 0.1 0.5 33 97 192.168.77.100:/var/tmp/s
1024+0 records in
1024+0 records out
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4    0.0    0.0    0.0  0.0  0.0    0.0    5.4   0   0 c5t0d0
461.0 0.0 14.4 0.0 0.1 0.2 0.1 0.4 4 8 192.168.77.100:/var/tmp/s

#### Is Dooming working OK?

> rpclog/W 8
rpclog:         0               =       0x8
> clnt_max_conns/W 8
clnt_max_conns: 0x4             =       0x8

* ls -als on share

haggis:/tmp/c# netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049  192.168.77.29.1023   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1022   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1021   49152      0 49308      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1020   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1019   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1018   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1017   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1016   49152      0 49304      0 ESTABLISHED

> clnt_max_conns/W 4
clnt_max_conns: 0x8             =       0x4

haggis:/tmp/c# ls -als
total 7235597
   3 drwxr-xr-x   2 root     root          10 Mar 16 10:47 .
   8 drwxrwxrwt   8 root     sys          720 Mar 16 19:57 ..
878969 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_1
794225 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_2
1130659 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_3
865401 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_4
778861 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_5
953995 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_6
969355 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_7
864121 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_8
haggis:/tmp/c# Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many conns, dooming entry ffffff01e0ca9240 Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many conns, dooming entry ffffff0205a14280 Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many conns, dooming entry ffffff01fed6d700 Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many conns, dooming entry ffffff01f468d380

... waiting a while ...

Mar 16 20:05:12 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending ordrel to queue ffffff020b4a7e50 Mar 16 20:05:12 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff020b4a7e50: got T_ORDREL_REQ Mar 16 20:05:12 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff020b6cde40: got T_ORDREL_REQ
netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049  192.168.77.29.1022   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1021   49152      0 49308      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1020   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1019   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1018   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1017   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1016   49152      0 49304      0 ESTABLISHED
haggis:/tmp/c# Mar 16 20:05:46 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending ordrel to queue ffffff02039093b0 Mar 16 20:05:46 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff02039093b0: got T_ORDREL_REQ Mar 16 20:05:46 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff020b6c88f8: got T_ORDREL_REQ Mar 16 20:06:02 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending ordrel to queue ffffff0202e440f8 Mar 16 20:06:02 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff0202e440f8: got T_ORDREL_REQ Mar 16 20:06:02 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff020b6cd3a0: got T_ORDREL_REQ Mar 16 20:06:03 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending ordrel to queue ffffff02030cdba0 Mar 16 20:06:03 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff02030cdba0: got T_ORDREL_REQ Mar 16 20:06:03 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq 0xffffff02038798f0: got T_ORDREL_REQ
netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049  192.168.77.29.1019   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1018   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1017   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1016   49152      0 49304      0 ESTABLISHED


#### Simple test with one connection only

> clnt_max_conns/W 1
clnt_max_conns: 0x4             =       0x1

haggis:/tmp/c# iostat -xznM 5& dd if=1g_3 of=/dev/null bs=$((1024*1024))
[2] 100717
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   17.5    4.6    1.4    0.1  0.4  0.5   16.8   24.7   6  11 c5t0d0
98.8 0.0 3.1 0.0 0.0 0.0 0.1 0.5 0 3 192.168.77.100:/var/tmp/s
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  533.7    0.0   66.7    0.0 24.8  5.6   46.5   10.5  91  98 c5t0d0
1916.6 0.0 59.7 0.0 0.4 2.3 0.2 1.2 21 96 192.168.77.100:/var/tmp/s
kill %2
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  350.4    0.0   43.8    0.0 19.8  4.5   56.5   12.7  73  76 c5t0d0
2958.3 0.0 90.5 0.0 0.3 2.4 0.1 0.8 21 97 192.168.77.100:/var/tmp/s
1024+0 records in
1024+0 records out
haggis:/tmp/c# kill %2

haggis:/tmp/c# ls -als
total 7235597
   3 drwxr-xr-x   2 root     root          10 Mar 16 10:47 .
   8 drwxrwxrwt   9 root     sys         1317 Mar 16 20:38 ..
878969 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_1
794225 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_2
1130659 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_3
865401 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_4
778861 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_5
953995 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_6
969355 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_7
864121 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_8
haggis:/tmp/c# netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049  192.168.77.29.1015   49152      0 49304      0 ESTABLISHED

#### test with >1 destination address

haggis:/tmp/d# mkdir /tmp/e
haggis:/tmp/d# mount -o vers=3,proto=tcp 192.168.77.29:/var/tmp/s /tmp/e
haggis:/tmp/d# netstat -an| egrep '^192.168.77.(29|100).2049'
192.168.77.100.2049  192.168.77.29.1015   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1014   49152      0 49308      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1013   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1012   49152      0 49304      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1011   49152      0 49304      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1010   49152      0 49304      0 ESTABLISHED
haggis:/tmp/d# cd /tmp/e
haggis:/tmp/e# ls -als
total 7235597
   3 drwxr-xr-x   2 root     root          10 Mar 16 10:47 .
   8 drwxrwxrwt  11 root     sys         1433 Mar 16 20:41 ..
878969 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_1
794225 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_2
1130659 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_3
865401 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_4
778861 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_5
953995 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_6
969355 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_7
864121 -rw-------   1 root     root     1073741824 Mar 16 10:49 1g_8
haggis:/tmp/e# netstat -an| egrep '^192.168.77.(29|100).2049'
192.168.77.100.2049  192.168.77.29.1015   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1014   49152      0 49308      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1013   49152      0 49304      0 ESTABLISHED
192.168.77.100.2049  192.168.77.29.1012   49152      0 49304      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1011   49152      0 49304      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1010   49152      0 49304      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1009   49152      0 49308      0 ESTABLISHED
192.168.77.29.2049   192.168.77.29.1008   49152      0 49308      0 ESTABLISHED
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to