On Fri, 18 May 2007, Miles O'Neal wrote:

We're getting "do_ypcall: clnt_call: RPC: Timed out"
errors.

We're in the process of upgrading to 4.4,
starting with some new 64 bit Supermnicros,
some with a single Xeon dual core and some with
a single Core 2 Duo.  Both have Intel e1000
ethernet chipsets.

We use NIS for user passwd and group entries,
as well as netgroups, services and automounts.
This has worked for us on 32 bit systems from
Redhat5.2 up through SL30{4,7} (including some
64 bit Athlons running a 32 bit OS).  We can
reproduce this on the 32 bit SL3 systems, but
they're a lot slower, and it takes some effort
to do it.

We first saw problems with torque (we've used
PBS Pro in the past), but narrowed it down to
rsh (and even a bare bones program running
rcmd()).  A single, random rsh call is fairly
safe, but if we do one every second or two,
we quickly start getting hangs and the error:

  do_ypcall: clnt_call: RPC: Timed out

The glibc code for doing nis calls will retry about 4 times (well it was last time I checked), and if the server doesn't answer by then it errors.

If you manage to send sufficiently many requests to the server that *it*
can't cope then you will see these messages. Some ypserv implementations cope better with load than others...

Now glibc sends the yp requests from a privelaged port and lets the system pick, so ends up cycling though the available range.

Now we have some servers with Intel mboards with braindead BMC chipsets which eat all traffic to the IPMI ports. When anything happened to pick those ports it never gets an answer so will time out. We saw *lots* of this especially doing things which caused lots of yp requests -- until we tracked it down and caused things to avoid the IPMI ports.

Can you just do the sanity check and see if there is any correlation between the errors and ports in use at the time? In our case tcpdump would show a packet being sent but no reply and it was pretty obvious from those logs that anything using ports 623 and 664 (tcp and udp) was broken...

So it can happen at any time, but when we fire
off lots of jobs in quick succession via torque,
it's guaranteed to happen.  We have also seen
this with less frequency in some home grown tools.

We've stripped down NIS to bare essentials (using
only netgroup for testing), we've tried adding in
a 3Com ethernet card to use instead of the built
on cards, we've upgraded to the latest EL4 ypbind,
ypserv and glibc (which we found in a CERN repo
after looking through TUV's bug list), we've tried
adding more, faster NIS servers, and we've tried
isolating three machines on a 100Mb network (no
spare 1Gb switches).  And tried running the non-SMP
kernel.  No difference.

I assume that you also checked for firewall issues at both ends...

Bizarrely, we also get whining in the SL3 ypservers'
message logs about failed NIS host lookups.  We don't
use NIS for host lookups; nsswitch.conf has

  hosts:   files dns

.  We had only used solaris servers in the past,
and their ypserv's were not logging these errors.
Presumably they still got the requests, but we
don't know that.

Do you have any libc5 code perhaps?

We ran ypserv in debug mode for a while, and nothing
jumped out at us.

We started running nscd for passwd and group on all
the Linux systems after this started.  No change.

The switches are Cisco Gb switches and HP ProCurve
Gb switches (the isolated test network was a 3Com
100Mb switch).

Any ideas on either problem?

Thanks,
Miles

TEST SCRIPT (works every time with failure in less than
10 rsh calls on our faster boxes on the Gb network):

        #!/bin/csh

        # set LIST_OF_HOSTNAMES to a valid list of hosts
        # to try, the more the merrier.  We use a command
        # to generate these from a file of valid names.

        while ( 1 )
                foreach i ( $LIST_OF_HOSTNAMES )
                        rsh $i uname -a # or any command you like
                end
        end

Do you also see it with ssh connections? I ask 'cos rsh also picks a privelaged (tcp) port...

--
Jon Peatfield,  Computer Officer,  DAMTP,  University of Cambridge
Mail:  [EMAIL PROTECTED]     Web:  http://www.damtp.cam.ac.uk/

Reply via email to