Deeper searching brought us to a Solaris 10 source from 2005 (?) where
the network stack looked like the stack trace from our dtrace script.
With the information found there it was possible to further examine the
reason for dropping the packets: It turns out, that the extended buffer
length from our ndd command is not used for this particular socket
connection.
Which implies the udpBufSize for the socket is set explicitly probably
by openAFS itself.
Examining the AFS source for the occurrence of SO_RCVBUF setsockopt()
yields following c code
len = rx_UdpBufSize;
error = sockfs_sosetsockopt(so, SOL_SOCKET, SO_RCVBUF, &len,
sizeof(len));
Where rx_UdpBufSize turns out to be a constant defined in a global
.h-file having the value 64*1024
We are presently working on modifying this constant setting and testing
if this helps.
On 19.04.17 19:26, Karl Behler wrote:
Dear All,
I today posted following question in the Oracle Solaris Community network.
However, I'd like also to hear your opinion or hints about it.
In fact under high load (30+ users logged in into one RayServer) we
are occasionally experiencing hangs of several seconds up to minutes
for the users desktops. However, this is not a continuously growing
degradation but more like hitting a wall (nothing goes) and then
falling back as if nothing had happened. But the frequency of these
events is going up over the week and typically on afternoons
occasionally systems may come to a complete and sometimes
unrecoverable halt (needing a hard reboot).
We are using openAFS as network filesystem for home directories and
data files on our SunRay servers. As the number of users sessions
(20-40) and the activity (5-10% CPU usage) rise on these SunRay
servers we observe alarming values from the command "netstat -s".
IPv4 ipForwarding = 2 ipDefaultTTL = 255
ipInReceives =123501543 ipInHdrErrors = 0
ipInAddrErrors = 0 ipInCksumErrs = 0
ipForwDatagrams = 0 ipForwProhibits = 19
ipInUnknownProtos = 3361 ipInDiscards = 6
ipInDelivers =123975114 ipOutRequests =114328336
...
tcpInErrs = 0 udpNoPorts = 3248
udpInCksumErrs = 0 udpInOverflows = 1848
...
As the protocol used between the openAFS file servers and the openAFS
clients on the SunRay servers is UDP based, we used rxdebug for
further insight. We are seeing resends and waiting calls.
Using dtrace to analyze what kind of packets are dropped one clearly
(and only) sees packets sent from openAFS fileservers
x.x.[100.59,100.67,30.41] to the SunRay server client on x.x.100.129:
13 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
dst: x.x.100.129:7001 count: 1
13 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
dst: x.x.100.129:7001 count: 1
13 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
dst: x.x.100.129:7001 count: 1
17 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.67:7000
dst: x.x.100.129:7001 count: 1
13 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
dst: x.x.100.129:7001 count: 1
13 5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
dst: x.x.100.129:7001 count: 1
35 5093 ip_udp_input:udpIfStatsInOverflows src: x.x. 30.41:7000
dst: x.x.100.129:7001 count: 1
According to knowledgebase document "netstat -s : Information and
notes (Doc ID 1010792.1
<https://support.oracle.com/rs?type=doc&id=1010792.1>)" it is
recommended to increase the udp_recv_hiwat parameter.
However, using "ndd set /dev/udp udp_recv_hiwat 134217728" (128 MB)
does not reduce the udpInOverflows.
Further research using dtrace delivers evidence that UDP packets are
discarded early in the IP stack.
dtrace -n 'mib:::udpIfStatsInOverflows{stack();}'
12 4648 ip_udp_input:udpIfStatsInOverflows
ip`ip_input+0xcb2
dls`soft_ring_drain+0x93
dls`soft_ring_worker+0xdb
unix`thread_start+0x8
However, for us here ends the debug possibility and the insight
what's going on.
This all happens on an Oracle Blade X4-2B (blade center 6000) under
SunOS 5.10 Generic_150401-48 with SunRay server software
4.5.4_34,REV=2015.04.14.10.39. Two 10Gb network interfaces are used.
Any hint how one could proceed analyzing the situation (e.g.
understanding the stack) or trying further system tuning parameters
is very much appreciated.
I'm including the netstat document from the Oracle knowledge base
since it can not be found by a web search.
Best regards,
Karl
--
Dr. Karl Behler
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351
--
Dr. Karl Behler
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351