Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

Robert Banz Fri, 06 Oct 2006 08:32:10 -0700

First, upgrade your fileserver an actual "production" release,such as 1.4.1. 1.3.81 was pretty good, but, not withoutproblems. (1.4.1 is not without problems, but with less.)
We are thinking of that as a one (last) of possibility, but we arerunning tens of linux (Debian/stable) servers (not only AFS) as apart of our distributed computing environment and we are trying tokeep our server configuration as close as possible to stable dist.And short summary: we don't have any significant AFS problems withsame configuration for 1+years...

Keeping with "random linux distro's" idea of stable for your AFS codeis not a good idea. Stick with OpenAFS's idea of stable -- and whilefor short periods I've ran "development" (e.g late 1.3.*) code on myproduction AFS servers when I was in a pinch, stick to the productionreleases. Ignore what Debian thinks, because they don't know whatthey're talking about ;)

Second, when your server goes into a this state, does it come outof it naturally or do you have to restart it?
Actually, this state can "freeze" many of our users and services(even if affected server servers RO replicas only... and yes, Ireally don't understand this behavior...) and FS is unable toreturn to "normal" state at reasonable time (actually / reasonabletime is pretty small for us/our users...). So, we are trying to"solve" our current problems with fs restart. :-(
(
As you can see from original post, FS is still alive, but has noidle threads. Waiting connections (clients) oscillate around 200and "probably" could be serve in tens of minutes...
)

You could have the "horrible" host callback table mutex lockupproblem. The most for-certain way to discover this is to generate acore from your running fileserver at the time (on Solaris I usegcore, but you could also kill -SEGV it instead of restarting),attach a debugger to the core, and see where the threads aresitting. If you've compiled your OpenAFS distribution with --enable-debug (which you should), and you examine the stack trace some of thethreads, you may see a lot of them here:

=>[5] CallPreamble(acall = ???, activecall = ???, tconn = ???, ahostp= ???) (optimized), at 0x8082178 (line ~315) in "afsfileprocs.c"

(dbx) list
  315       H_LOCK;
  316     retry:
  317       tclient = h_FindClient_r(*tconn);
  318       thost = tclient->host;
  319       if (tclient->prfail == 1) { /* couldn't get the CPS */
...

If this is the case...well...there's no for-sure way around it rightnow, though some people, IIRC, have been working on some code changesto avoid it. Some steps you can take, though, to mitigate theproblem involve making sure all your clients respond promptly ontheir AFS callback ports (7001/udp). With all of the packet manglersout on the network (hostbased firewalls, overanxious networkadministrators, etc.) you may find things "in the way" of the AFSfileservers contacting their clients on the callback port. One ofthe things that can cause this type of "lockup" are requests to theseclients timing out / taking a long time... If things have beenworking fine for "awhile" and now they don't, network topology/firewall changes like this could be a culprit.

I've attached a script that I periodically run to see how many "bad"clients are using my fileservers, so that I may try to track themdown and swat at them...


-----

#!/usr/local/bin/perl

$| = 1;

sub getclients {
        my $server = shift @_;

        my %ips;

        print STDERR "getting connections for $server\n";

open(RXDEBUG, "/usr/afsws/etc/rxdebug -allconnections$server|") || die

"cannot exec rxdebug\n";

        while(<RXDEBUG>) {

                if ( /Connection from host ([^, ]+)/ ) {
                        my $ip = $1;
                        if ( ! defined($ips{$ip}) ) {
                                $ips{$ip} = $ip;
                        }
                }
        }

        close RXDEBUG;

        return keys(%ips);
}

sub checkcmdebug {
        my $client = shift @_;

        print STDERR "checking $client\n";

open(CMDEBUG, "/usr/afsws/bin/cmdebug -cache $client 2>&1|")|| die "canot exec cmdebug\n";


        while(<CMDEBUG>) {
                if ( /server or network not responding/ ) {
                        return 0;
                }
        }
        close CMDEBUG;
        return 1;
}

my %clients;

# modify this to run getclients on all of your AFS servers...

foreach my $y ( "ifs1", "ifs2", "hfs1", "hfs2", "bfs1", "hfs11","hfs12" ) {

        foreach my $x ( &getclients($y.".afs.umbc.edu") ) {
                $clients{$x}++;
        }
}


use Socket;

foreach my $x ( keys(%clients) ) {
        if ( ! &checkcmdebug($x) ) {
                print "$x";
                use Socket;
                my $iaddr = inet_aton($x);
                my $name = gethostbyaddr($iaddr, AF_INET);
                print "($name)\n";
        }
}
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

Reply via email to