First, upgrade your fileserver an actual "production" release, such as 1.4.1. 1.3.81 was pretty good, but, not without problems. (1.4.1 is not without problems, but with less.)

We are thinking of that as a one (last) of possibility, but we are running tens of linux (Debian/stable) servers (not only AFS) as a part of our distributed computing environment and we are trying to keep our server configuration as close as possible to stable dist. And short summary: we don't have any significant AFS problems with same configuration for 1+years...

Keeping with "random linux distro's" idea of stable for your AFS code is not a good idea. Stick with OpenAFS's idea of stable -- and while for short periods I've ran "development" (e.g late 1.3.*) code on my production AFS servers when I was in a pinch, stick to the production releases. Ignore what Debian thinks, because they don't know what they're talking about ;)

Second, when your server goes into a this state, does it come out of it naturally or do you have to restart it?

Actually, this state can "freeze" many of our users and services (even if affected server servers RO replicas only... and yes, I really don't understand this behavior...) and FS is unable to return to "normal" state at reasonable time (actually / reasonable time is pretty small for us/our users...). So, we are trying to "solve" our current problems with fs restart. :-(

(
As you can see from original post, FS is still alive, but has no idle threads. Waiting connections (clients) oscillate around 200 and "probably" could be serve in tens of minutes...
)

You could have the "horrible" host callback table mutex lockup problem. The most for-certain way to discover this is to generate a core from your running fileserver at the time (on Solaris I use gcore, but you could also kill -SEGV it instead of restarting), attach a debugger to the core, and see where the threads are sitting. If you've compiled your OpenAFS distribution with --enable- debug (which you should), and you examine the stack trace some of the threads, you may see a lot of them here:

=>[5] CallPreamble(acall = ???, activecall = ???, tconn = ???, ahostp = ???) (optimized), at 0x8082178 (line ~315) in "afsfileprocs.c"
(dbx) list
  315       H_LOCK;
  316     retry:
  317       tclient = h_FindClient_r(*tconn);
  318       thost = tclient->host;
  319       if (tclient->prfail == 1) { /* couldn't get the CPS */
...

If this is the case...well...there's no for-sure way around it right now, though some people, IIRC, have been working on some code changes to avoid it. Some steps you can take, though, to mitigate the problem involve making sure all your clients respond promptly on their AFS callback ports (7001/udp). With all of the packet manglers out on the network (hostbased firewalls, overanxious network administrators, etc.) you may find things "in the way" of the AFS fileservers contacting their clients on the callback port. One of the things that can cause this type of "lockup" are requests to these clients timing out / taking a long time... If things have been working fine for "awhile" and now they don't, network topology/ firewall changes like this could be a culprit.

I've attached a script that I periodically run to see how many "bad" clients are using my fileservers, so that I may try to track them down and swat at them...

-----

#!/usr/local/bin/perl

$| = 1;

sub getclients {
        my $server = shift @_;

        my %ips;

        print STDERR "getting connections for $server\n";

open(RXDEBUG, "/usr/afsws/etc/rxdebug -allconnections $server|") || die
"cannot exec rxdebug\n";

        while(<RXDEBUG>) {

                if ( /Connection from host ([^, ]+)/ ) {
                        my $ip = $1;
                        if ( ! defined($ips{$ip}) ) {
                                $ips{$ip} = $ip;
                        }
                }
        }

        close RXDEBUG;

        return keys(%ips);
}

sub checkcmdebug {
        my $client = shift @_;

        print STDERR "checking $client\n";

open(CMDEBUG, "/usr/afsws/bin/cmdebug -cache $client 2>&1|") || die "canot exec cmdebug\n";

        while(<CMDEBUG>) {
                if ( /server or network not responding/ ) {
                        return 0;
                }
        }
        close CMDEBUG;
        return 1;
}

my %clients;

# modify this to run getclients on all of your AFS servers...

foreach my $y ( "ifs1", "ifs2", "hfs1", "hfs2", "bfs1", "hfs11", "hfs12" ) {
        foreach my $x ( &getclients($y.".afs.umbc.edu") ) {
                $clients{$x}++;
        }
}


use Socket;

foreach my $x ( keys(%clients) ) {
        if ( ! &checkcmdebug($x) ) {
                print "$x";
                use Socket;
                my $iaddr = inet_aton($x);
                my $name = gethostbyaddr($iaddr, AF_INET);
                print "($name)\n";
        }
}
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to