[OpenAFS] Re: undpInOverflows on heavily loaded AFS clients (Solaris 10 X86, SunRay server)

Karl Behler Thu, 20 Apr 2017 10:27:11 -0700

Deeper searching brought us to a Solaris 10 source from 2005 (?) wherethe network stack looked like the stack trace from our dtrace script.With the information found there it was possible to further examine thereason for dropping the packets: It turns out, that the extended bufferlength from our ndd command is not used for this particular socketconnection.Which implies the udpBufSize for the socket is set explicitly probablyby openAFS itself.

Examining the AFS source for the occurrence of SO_RCVBUF setsockopt()yields following c code


   len = rx_UdpBufSize;

        error = sockfs_sosetsockopt(so, SOL_SOCKET, SO_RCVBUF, &len,
   sizeof(len));

Where rx_UdpBufSize turns out to be a constant defined in a global.h-file having the value 64*1024

We are presently working on modifying this constant setting and testingif this helps.




On 19.04.17 19:26, Karl Behler wrote:

Dear All,

I today posted following question in the Oracle Solaris Community network.
However, I'd like also to hear your opinion or hints about it.
In fact under high load (30+ users logged in into one RayServer) weare occasionally experiencing hangs of several seconds up to minutesfor the users desktops. However, this is not a continuously growingdegradation but more like hitting a wall (nothing goes) and thenfalling back as if nothing had happened. But the frequency of theseevents is going up over the week and typically on afternoonsoccasionally systems may come to a complete and sometimesunrecoverable halt (needing a hard reboot).
We are using openAFS as network filesystem for home directories anddata files on our SunRay servers. As the number of users sessions(20-40) and the activity (5-10% CPU usage) rise on these SunRayservers we observe alarming values from the command "netstat -s".
IPv4    ipForwarding        =     2     ipDefaultTTL =   255
        ipInReceives        =123501543  ipInHdrErrors =     0
        ipInAddrErrors      =     0     ipInCksumErrs =     0
        ipForwDatagrams     =     0     ipForwProhibits =    19
        ipInUnknownProtos   =  3361     ipInDiscards =     6
        ipInDelivers        =123975114  ipOutRequests =114328336
...
        tcpInErrs           =     0     udpNoPorts =  3248
        udpInCksumErrs      =     0     udpInOverflows =  1848
...
As the protocol used between the openAFS file servers and the openAFSclients on the SunRay servers is UDP based, we used rxdebug forfurther insight. We are seeing resends and waiting calls.
Using dtrace to analyze what kind of packets are dropped one clearly(and only) sees packets sent from openAFS fileserversx.x.[100.59,100.67,30.41] to the SunRay server client on x.x.100.129:
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    17   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.67:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    13   5093 ip_udp_input:udpIfStatsInOverflows src: x.x.100.59:7000
    dst: x.x.100.129:7001 count: 1
    35   5093 ip_udp_input:udpIfStatsInOverflows src: x.x. 30.41:7000
    dst: x.x.100.129:7001 count: 1
According to knowledgebase document "netstat -s : Information andnotes (Doc ID 1010792.1<https://support.oracle.com/rs?type=doc&id=1010792.1>)" it isrecommended to increase the udp_recv_hiwat parameter.
However, using "ndd set /dev/udp udp_recv_hiwat 134217728" (128 MB)does not reduce the udpInOverflows.
Further research using dtrace delivers evidence that UDP packets arediscarded early in the IP stack.
dtrace -n  'mib:::udpIfStatsInOverflows{stack();}'

12   4648 ip_udp_input:udpIfStatsInOverflows
              ip`ip_input+0xcb2
              dls`soft_ring_drain+0x93
              dls`soft_ring_worker+0xdb
              unix`thread_start+0x8
However, for us here ends the debug possibility and the insightwhat's going on.
This all happens on an Oracle Blade X4-2B (blade center 6000) underSunOS 5.10 Generic_150401-48 with SunRay server software4.5.4_34,REV=2015.04.14.10.39. Two 10Gb network interfaces are used.
Any hint how one could proceed analyzing the situation (e.g.understanding the stack) or trying further system tuning parametersis very much appreciated.
I'm including the netstat document from the Oracle knowledge basesince it can not be found by a web search.
Best regards,

Karl


--
Dr. Karl Behler 
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351



--
Dr. Karl Behler 
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351

[OpenAFS] Re: undpInOverflows on heavily loaded AFS clients (Solaris 10 X86, SunRay server)

Reply via email to