Deeper searching brought us to a Solaris 10 source from 2005 (?) where the network stack looked like the stack trace from our dtrace script. With the information found there it was possible to further examine the reason for dropping the packets: It turns out, that the extended buffer length from our ndd command is not used for this particular socket connection. Which implies the udpBufSize for the socket is set explicitly probably by openAFS itself.

Examining the AFS source for the occurrence of SO_RCVBUF setsockopt() yields following c code

   len = rx_UdpBufSize;

        error = sockfs_sosetsockopt(so, SOL_SOCKET, SO_RCVBUF, &len,
   sizeof(len));

Where rx_UdpBufSize turns out to be a constant defined in a global .h-file having the value 64*1024

We are presently working on modifying this constant setting and testing if this helps.



On 19.04.17 19:26, Karl Behler wrote:

Dear All,

I today posted following question in the Oracle Solaris Community network.
However, I'd like also to hear your opinion or hints about it.

In fact under high load (30+ users logged in into one RayServer) we are occasionally experiencing hangs of several seconds up to minutes for the users desktops. However, this is not a continuously growing degradation but more like hitting a wall (nothing goes) and then falling back as if nothing had happened. But the frequency of these events is going up over the week and typically on afternoons occasionally systems may come to a complete and sometimes unrecoverable halt (needing a hard reboot).

We are using openAFS as network filesystem for home directories and data files on our SunRay servers. As the number of users sessions (20-40) and the activity (5-10% CPU usage) rise on these SunRay servers we observe alarming values from the command "netstat -s".

IPv4    ipForwarding        =     2     ipDefaultTTL =   255
        ipInReceives        =123501543  ipInHdrErrors =     0
        ipInAddrErrors      =     0     ipInCksumErrs =     0
        ipForwDatagrams     =     0     ipForwProhibits =    19
        ipInUnknownProtos   =  3361     ipInDiscards =     6
        ipInDelivers        =123975114  ipOutRequests =114328336
...
        tcpInErrs           =     0     udpNoPorts =  3248
        udpInCksumErrs      =     0     udpInOverflows =  1848
...

As the protocol used between the openAFS file servers and the openAFS clients on the SunRay servers is UDP based, we used rxdebug for further insight. We are seeing resends and waiting calls.

Using dtrace to analyze what kind of packets are dropped one clearly (and only) sees packets sent from openAFS fileservers x.x.[100.59,100.67,30.41] to the SunRay server client on x.x.100.129:

    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    17   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.67:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    35   5093 ip_udp_input:udpIfStatsInOverflows src: x.x. 30.41:7000
    dst: x.x.100.129:7001 count: 1

According to knowledgebase document "netstat -s : Information and notes (Doc ID 1010792.1 <https://support.oracle.com/rs?type=doc&id=1010792.1>)" it is recommended to increase the udp_recv_hiwat parameter.

However, using "ndd set /dev/udp udp_recv_hiwat 134217728" (128 MB) does not reduce the udpInOverflows.

Further research using dtrace delivers evidence that UDP packets are discarded early in the IP stack.

dtrace -n  'mib:::udpIfStatsInOverflows{stack();}'

12   4648 ip_udp_input:udpIfStatsInOverflows
              ip`ip_input+0xcb2
              dls`soft_ring_drain+0x93
              dls`soft_ring_worker+0xdb
              unix`thread_start+0x8

However, for us here ends the debug possibility and the insight what's going on.

This all happens on an Oracle Blade X4-2B (blade center 6000) under SunOS 5.10 Generic_150401-48 with SunRay server software 4.5.4_34,REV=2015.04.14.10.39. Two 10Gb network interfaces are used.

Any hint how one could proceed analyzing the situation (e.g. understanding the stack) or trying further system tuning parameters is very much appreciated.

I'm including the netstat document from the Oracle knowledge base since it can not be found by a web search.

Best regards,

Karl


--
Dr. Karl Behler 
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351



--
Dr. Karl Behler 
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351

Reply via email to