Hi All,
I have written a fix for the issue initially documented in 6696163, for which I
have now opened 6817942 to get one bug for one issue (BTW: could someone please
correct my typo in the 6817942 summary: s:>=:>:).
The root cause is documented in 6696163: When choosing a connection for an RPC
call, the old connmgr_get code used lbolt timestamps which don't have enough
precision (nowadays) to yield proper load balancing.
I have implemented round robin load balancing: Whenever a connection is
requested via connmgr_get, it is noted as last used. Upon the next connection
request, the connection which comes next in the list of connections (or the
first one if the last one was used last) is returned.
I have left a comment in the code explaining why I chose the particular approach
and what I think should be done when someone comes to do a larger rewrite of the
code in question.
I would appreciate reviews of my suggested fix, which is available at
http://cr.opensolaris.org/~nigoroll/rpc_loadbalancing_6817942/
Thanks, Nils
--
Tests I conducted:
#### PREPARATIONS
> clnt_max_conns/D
clnt_max_conns:
clnt_max_conns: 4
> rpclog/W b
rpclog: 0 = 0xb
ifconfig e1000g0 addif 192.168.77.100/24 up deprecated
share -o rw=localhost:192.168.77.29,root=localhost:192.168.77.29 \
/var/tmp/s
mkdir /tmp/c
cd ; umount /tmp/c ; \
mount -o vers=3,proto=tcp 192.168.77.100:/var/tmp/s /tmp/c ; \
cd /tmp/c
#### TEST BALANCING
> rpclog/W 2
rpclog: 0 = 0x2
> clnt_max_conns/D
clnt_max_conns:
clnt_max_conns: 4
- mount share (see above)
- ls -als
haggis:/tmp/c# egrep '19:57:.*call going out' /var/adm/messages
Mar 16 19:58:10 haggis rpcmod: [ID 595603 kern.notice] clnt_tli_kinit: COTS
selected
Mar 16 19:58:10 haggis rpcmod: [ID 359612 kern.notice] clnt_cots_kcallit,
procnum 4
Mar 16 19:58:10 haggis rpcmod: [ID 832978 kern.notice] clnt_cots_kcallit:
wait.tv_sec: 60
Mar 16 19:58:10 haggis rpcmod: [ID 359371 kern.notice] clnt_cots_kcallit:
wait.tv_usec: 0
Mar 16 19:57:22 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f4598c38
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b3c8
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b5f0
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b760
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f4598c38
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b3c8
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b5f0
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b760
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f4598c38
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b3c8
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b5f0
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f459b760
Mar 16 19:57:26 haggis rpcmod: [ID 757553 kern.notice] connmgr_get: call going
out on ffffff01f4598c38
[...etc...]
#### Test balancing throughput (can't to a real performance test on a single
#### laptop, this is more a test against massive performance regression)
haggis:/tmp/c# iostat -xznM 5 & dd if=1g_1 of=/dev/zero bs=$((1024*1024))&
[2] 100689
[3] 100690
haggis:/tmp/c# extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
24.1 7.5 1.1 0.1 0.8 0.3 24.3 8.1 12 20 c5t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 122.5 0 0
192.168.77.100:/var/tmp/s
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
486.4 10.9 60.6 0.1 2.4 28.1 4.7 56.5 77 100 c5t0d0
1907.1 0.0 59.6 0.0 0.4 1.8 0.2 0.9 18 97
192.168.77.100:/var/tmp/s
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
206.5 1.2 25.8 0.0 2.4 11.8 11.7 56.7 40 40 c5t0d0
4199.2 0.0 131.2 0.0 0.5 2.2 0.1 0.5 33 97
192.168.77.100:/var/tmp/s
1024+0 records in
1024+0 records out
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.4 0.0 0.0 0.0 0.0 0.0 0.0 5.4 0 0 c5t0d0
461.0 0.0 14.4 0.0 0.1 0.2 0.1 0.4 4 8
192.168.77.100:/var/tmp/s
#### Is Dooming working OK?
> rpclog/W 8
rpclog: 0 = 0x8
> clnt_max_conns/W 8
clnt_max_conns: 0x4 = 0x8
* ls -als on share
haggis:/tmp/c# netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049 192.168.77.29.1023 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1022 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1021 49152 0 49308 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1020 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1019 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1018 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1017 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1016 49152 0 49304 0 ESTABLISHED
> clnt_max_conns/W 4
clnt_max_conns: 0x8 = 0x4
haggis:/tmp/c# ls -als
total 7235597
3 drwxr-xr-x 2 root root 10 Mar 16 10:47 .
8 drwxrwxrwt 8 root sys 720 Mar 16 19:57 ..
878969 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_1
794225 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_2
1130659 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_3
865401 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_4
778861 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_5
953995 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_6
969355 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_7
864121 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_8
haggis:/tmp/c# Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice]
connmgr_get: too many conns, dooming entry ffffff01e0ca9240
Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many
conns, dooming entry ffffff0205a14280
Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many
conns, dooming entry ffffff01fed6d700
Mar 16 20:01:56 haggis rpcmod: [ID 882835 kern.notice] connmgr_get: too many
conns, dooming entry ffffff01f468d380
... waiting a while ...
Mar 16 20:05:12 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending
ordrel to queue ffffff020b4a7e50
Mar 16 20:05:12 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff020b4a7e50: got T_ORDREL_REQ
Mar 16 20:05:12 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff020b6cde40: got T_ORDREL_REQ
netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049 192.168.77.29.1022 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1021 49152 0 49308 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1020 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1019 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1018 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1017 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1016 49152 0 49304 0 ESTABLISHED
haggis:/tmp/c# Mar 16 20:05:46 haggis rpcmod: [ID 772401 kern.notice]
connmgr_sndrel: sending ordrel to queue ffffff02039093b0
Mar 16 20:05:46 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff02039093b0: got T_ORDREL_REQ
Mar 16 20:05:46 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff020b6c88f8: got T_ORDREL_REQ
Mar 16 20:06:02 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending
ordrel to queue ffffff0202e440f8
Mar 16 20:06:02 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff0202e440f8: got T_ORDREL_REQ
Mar 16 20:06:02 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff020b6cd3a0: got T_ORDREL_REQ
Mar 16 20:06:03 haggis rpcmod: [ID 772401 kern.notice] connmgr_sndrel: sending
ordrel to queue ffffff02030cdba0
Mar 16 20:06:03 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff02030cdba0: got T_ORDREL_REQ
Mar 16 20:06:03 haggis rpcmod: [ID 776223 kern.notice] mir_wput_other wq
0xffffff02038798f0: got T_ORDREL_REQ
netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049 192.168.77.29.1019 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1018 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1017 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1016 49152 0 49304 0 ESTABLISHED
#### Simple test with one connection only
> clnt_max_conns/W 1
clnt_max_conns: 0x4 = 0x1
haggis:/tmp/c# iostat -xznM 5& dd if=1g_3 of=/dev/null bs=$((1024*1024))
[2] 100717
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
17.5 4.6 1.4 0.1 0.4 0.5 16.8 24.7 6 11 c5t0d0
98.8 0.0 3.1 0.0 0.0 0.0 0.1 0.5 0 3
192.168.77.100:/var/tmp/s
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
533.7 0.0 66.7 0.0 24.8 5.6 46.5 10.5 91 98 c5t0d0
1916.6 0.0 59.7 0.0 0.4 2.3 0.2 1.2 21 96
192.168.77.100:/var/tmp/s
kill %2
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
350.4 0.0 43.8 0.0 19.8 4.5 56.5 12.7 73 76 c5t0d0
2958.3 0.0 90.5 0.0 0.3 2.4 0.1 0.8 21 97
192.168.77.100:/var/tmp/s
1024+0 records in
1024+0 records out
haggis:/tmp/c# kill %2
haggis:/tmp/c# ls -als
total 7235597
3 drwxr-xr-x 2 root root 10 Mar 16 10:47 .
8 drwxrwxrwt 9 root sys 1317 Mar 16 20:38 ..
878969 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_1
794225 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_2
1130659 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_3
865401 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_4
778861 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_5
953995 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_6
969355 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_7
864121 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_8
haggis:/tmp/c# netstat -an| egrep '^192.168.77.100.2049'
192.168.77.100.2049 192.168.77.29.1015 49152 0 49304 0 ESTABLISHED
#### test with >1 destination address
haggis:/tmp/d# mkdir /tmp/e
haggis:/tmp/d# mount -o vers=3,proto=tcp 192.168.77.29:/var/tmp/s /tmp/e
haggis:/tmp/d# netstat -an| egrep '^192.168.77.(29|100).2049'
192.168.77.100.2049 192.168.77.29.1015 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1014 49152 0 49308 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1013 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1012 49152 0 49304 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1011 49152 0 49304 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1010 49152 0 49304 0 ESTABLISHED
haggis:/tmp/d# cd /tmp/e
haggis:/tmp/e# ls -als
total 7235597
3 drwxr-xr-x 2 root root 10 Mar 16 10:47 .
8 drwxrwxrwt 11 root sys 1433 Mar 16 20:41 ..
878969 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_1
794225 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_2
1130659 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_3
865401 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_4
778861 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_5
953995 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_6
969355 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_7
864121 -rw------- 1 root root 1073741824 Mar 16 10:49 1g_8
haggis:/tmp/e# netstat -an| egrep '^192.168.77.(29|100).2049'
192.168.77.100.2049 192.168.77.29.1015 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1014 49152 0 49308 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1013 49152 0 49304 0 ESTABLISHED
192.168.77.100.2049 192.168.77.29.1012 49152 0 49304 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1011 49152 0 49304 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1010 49152 0 49304 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1009 49152 0 49308 0 ESTABLISHED
192.168.77.29.2049 192.168.77.29.1008 49152 0 49308 0 ESTABLISHED
_______________________________________________
networking-discuss mailing list
[email protected]