Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-11-05 Thread Michal Svamberg

Hi,
Thank for the link. The problem is that the clients have the same UUID
because they have the same SID. This problem is seen at hosts.dump
(kill -XCPU pid_of_fileserver) near the line with string
lock:, for example:
---cut---
ip:360de493 port:7001 hidx:251 cbid:16297 lock: last:1159945605 active:1
159940686 down:0 del:0 cons:0 cldel:32
hpfailed:0 hcpsCall:1159943657 hcps [ -211] [ 330de493 3a0de493 370de49
3 360de493 430de493 440de493 3e0de493 420de493 3d0de493 470de493 320de493 480de4
93 490de493 450de493 340de493 3f0de493 350de493 410de493 400de493 3c0de493] hold
s: 3bf69 slot/bit: 0/1
---cut---

The IP addresses of wrong configurated clients are in line with
'hpfailed'. After reconfiguration all affected stations the meltdown
doesn't appear any more.

I have a question about this problem, do you consider about new option
with maximum clients with the same UUID that can connected to
fileserver? Or write warning message to FileLog (without debug)?
By my opinion it is not good if clients are able to shutdown a server.

Thanks for answer,
Michal Svamberg.

On 10/10/06, Derrick J Brashear [EMAIL PROTECTED] wrote:

On Tue, 10 Oct 2006, Michal Svamberg wrote:

 We upgraded file servers to 1.4.1 (built  2006-05-05) but not solve meltdown.

get a backtrace when the fileserver is not responding.

on a whim, you might also try this patch:
http://grand.central.org/rt/Ticket/Display.html?id=19461

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-11-05 Thread Derrick J Brashear

On Sun, 5 Nov 2006, Michal Svamberg wrote:


Hi,
Thank for the link. The problem is that the clients have the same UUID
because they have the same SID. This problem is seen at hosts.dump


Ugh.


I have a question about this problem, do you consider about new option
with maximum clients with the same UUID that can connected to
fileserver? Or write warning message to FileLog (without debug)?
By my opinion it is not good if clients are able to shutdown a server.


Ours either. Did you try that patch?

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-11-05 Thread Jeffrey Altman
Michal Svamberg wrote:
 Hi,
 Thank for the link. The problem is that the clients have the same UUID
 because they have the same SID. 

Are you saying that these are Windows machines and that (a) you cloned
the machines and did not delete the AFSCache or (b) are running cloned
machines with OAFW clients that are older than 1.4.0?

If so, delete the AFSCache files and you will get a new UUID; or
upgrade the clients and they will auto-detect the cloning and produce a
new UUID.

Jeffrey Altman



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-11-05 Thread Jeffrey Hutzelman
On Sun, 5 Nov 2006, Michal Svamberg wrote:

 Thank for the link. The problem is that the clients have the same UUID

Don't do this.  UUID stands for Universally Unique IDentifier; each
client _MUST_ have a different UUID.  If two or more clients have the same
UUID, then the fileserver thinks that they are the same client, and you
will have all sorts of problems.  This is the moral equivalent of trying
to have two or more machines on your network with the same IP address --
it will not work.

-- Jeffrey T. Hutzelman (N3NHS) [EMAIL PROTECTED]
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-11-05 Thread Jeffrey Altman
Michal Svamberg wrote:
 I have a question about this problem, do you consider about new option
 with maximum clients with the same UUID that can connected to
 fileserver? Or write warning message to FileLog (without debug)?
 By my opinion it is not good if clients are able to shutdown a server.

If you have 100 clients with the same UUID as far as the file server
is concerned, they are the same client that just happens to be
multi-homed with 100 address-port combinations.

The whole point of the UUID is to provide a unique identifier for
each client.  If you are distributing clients that are all manually
configured to use the same UUID, that is an administrative problem.

The specific problem that you have run into is that your clients have
registered more callbacks via FetchStatus calls then can be maintained
by the file server in the callbacks table.  The file server is therefore
searching for a callback that can be deregistered.  It doesn't want to
deregister a callback from the same host that is attempting to register
a new one because doing so could produce a feedback loop.  Unfortunately
because all of your clients have the same UUID, there is only one host
entry and there are no other hosts from which callbacks can be
deregistered.  The answer is not to place a limit on the number of
address-port values that can be associated with a host.  Doing so would
adversely impact clients behind NATs or that migrate across networks
that dynamically assign IP addresses.

The answer is to fix the clients.

Jeffrey Altman


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-10 Thread Michal Svamberg

We upgraded file servers to 1.4.1 (built  2006-05-05) but not solve meltdown.

Fileserver going in large mode and behaviour of meltdown are:
- first 10min: 12 idle threads are full used, only 2 idle threads
(get from rxdebug)
- 0: wprocs counting up from zero
- about next 10 min: up to 300 process waitings for thread
- fileserver server clear wprocs (sent VBUSY to clients?) but not
free threads, wprocs counting from zero again
- after next 10 min: loop is closed, wprocs counting to 300 and clearing its.
- any time: restart fileserver (via bos command) to normal running

Infliction of meltdown are:
- users servers (RW + backup volumes)
- software servers (RW + RO + backup volumes)
- replication serveres (RO volumes)

Upgrade not solved question about rx_ignoreAckedPacket. What packets are
marked as rx_ignoreAckedPacket?

I have a tons of logs but I don't know what search in logs.

Do you have any ideas?

Thanks, Michal Svamberg.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-10 Thread Derrick J Brashear

On Tue, 10 Oct 2006, Michal Svamberg wrote:


We upgraded file servers to 1.4.1 (built  2006-05-05) but not solve meltdown.


get a backtrace when the fileserver is not responding.

on a whim, you might also try this patch:
http://grand.central.org/rt/Ticket/Display.html?id=19461

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Michal Svamberg

Hello,
I don't know what is rx_ignoreAckedPacket. I have thousands (up to 5)
per 15 seconds of rx_ignoreAckedPacket on the fileserver. Number of
calls are less
(up to 1). Is posible tenth calls of rx_ignoreAckedPacket?

We have this infrastructure:
Fileservers (large mode): OpenAFS 1.3.81 built  2005-05-14 (debian/stable)

windows and linux clients from version 1.2 to 1.4 and for experimantal use 1.5
OpenAFS 1.2.10 built  2005-04-06
OpenAFS 1.3.82 built  2005-08-20
OpenAFS 1.4.2fc4 built  2006-10-02
OpenAFS1.4.0101

Some of fileserver sometimes going to meltdown state (calls waiting
for a thread)
and don't know reason. There is 'rxdebug -rxstats':

Free packets: 935, packet reclaims: 1283, calls: 2197185, used FDs: 64
not waiting for packets.
201 calls waiting for a thread
2 threads are idle
rx stats: free packets 935, allocs 7046769, alloc-failures(rcv
0/0,send 0/0,ack 0)
  greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers
0, selects 0, sendSelects 0
  packets read: data 2220845 ack 3323232 busy 5 abort 5125 ackall 3 challenge 4
response 1098 debug 43944 params 0 unused 0 unused 0 unused 0 version 0
  other read counters: data 2220774, ack 3322547, dup 0 spurious 165 dally 5
  packets sent: data 2851035 ack 54295 busy 592 abort 72 ackall 0 challenge 109
8 response 4 debug 0 params 0 unused 0 unused 0 unused 0 version 0
  other send counters: ack 54295, data 9546732 (not resends), resends 2908, pus
hed 0, ackedignored 3238665
   (these should be small) sendFailed 0, fatalErrors 0
  Average rtt is 0.006, with 745772 samples
  Minimum rtt is 0.000, maximum is 60.235
  518 server connections, 676 client connections, 706 peer structs, 350 call st
ructs, 0 free call structs

Thanks for any answer.

Michal Svamberg
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Robert Banz


On Oct 6, 2006, at 04:52, Michal Svamberg wrote:


Hello,
I don't know what is rx_ignoreAckedPacket. I have thousands (up to  
5)

per 15 seconds of rx_ignoreAckedPacket on the fileserver. Number of
calls are less
(up to 1). Is posible tenth calls of rx_ignoreAckedPacket?



First, upgrade your fileserver an actual production release, such  
as 1.4.1.  1.3.81 was pretty good, but, not without problems.  (1.4.1  
is not without problems, but with less.)


Second, when your server goes into a this state, does it come out of  
it naturally or do you have to restart it?


-rob
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Derrick J Brashear

On Fri, 6 Oct 2006, Robert Banz wrote:



On Oct 6, 2006, at 04:52, Michal Svamberg wrote:


Hello,
I don't know what is rx_ignoreAckedPacket. I have thousands (up to 5)
per 15 seconds of rx_ignoreAckedPacket on the fileserver. Number of
calls are less
(up to 1). Is posible tenth calls of rx_ignoreAckedPacket?



First, upgrade your fileserver an actual production release, such as 1.4.1. 
1.3.81 was pretty good, but, not without problems.  (1.4.1 is not without 
problems, but with less.)


Second, when your server goes into a this state, does it come out of it 
naturally or do you have to restart it?


Third, was there some previous version where this did not happen?

Derrick

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Lubos Kejzlar

Hi Robert,
and many thanks for your reply!

Let me answer on behalf on Michal - we are co-workes and Michal has left to 
vacations for couple of days ;-)


Robert Banz wrote:

First, upgrade your fileserver an actual production release, such as 
1.4.1.  1.3.81 was pretty good, but, not without problems.  (1.4.1 is 
not without problems, but with less.)


We are thinking of that as a one (last) of possibility, but we are running tens 
of linux (Debian/stable) servers (not only AFS) as a part of our distributed 
computing environment and we are trying to keep our server configuration as 
close as possible to stable dist. And short summary: we don't have any 
significant AFS problems with same configuration for 1+years...


Second, when your server goes into a this state, does it come out of it 
naturally or do you have to restart it?


Actually, this state can freeze many of our users and services (even if 
affected server servers RO replicas only... and yes, I really don't understand 
this behavior...) and FS is unable to return to normal state at reasonable 
time (actually / reasonable time is pretty small for us/our users...). So, we 
are trying to solve our current problems with fs restart. :-(


(
As you can see from original post, FS is still alive, but has no idle threads. 
Waiting connections (clients) oscillate around 200 and probably could be serve 
in tens of minutes...

)

Thanks again for your reply and suggestions!!


--
--
Lubos Kejzlar
Head of Laboratory for Computer Science

Center for Information Technology  Tel.:  +420-377 632 710
University of West Bohemia+420-724 094 277
Univerzitni 8, 306 14 Pilsen   Fax:   +420-377 632 702
Czech Republic E-mail:  [EMAIL PROTECTED]
--


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Robert Banz


First, upgrade your fileserver an actual production release,  
such as 1.4.1.  1.3.81 was pretty good, but, not without  
problems.  (1.4.1 is not without problems, but with less.)


We are thinking of that as a one (last) of possibility, but we are  
running tens of linux (Debian/stable) servers (not only AFS) as a  
part of our distributed computing environment and we are trying to  
keep our server configuration as close as possible to stable dist.  
And short summary: we don't have any significant AFS problems with  
same configuration for 1+years...


Keeping with random linux distro's idea of stable for your AFS code  
is not a good idea.  Stick with OpenAFS's idea of stable -- and while  
for short periods I've ran development (e.g late 1.3.*) code on my  
production AFS servers when I was in a pinch, stick to the production  
releases.  Ignore what Debian thinks, because they don't know what  
they're talking about ;)


Second, when your server goes into a this state, does it come out  
of it naturally or do you have to restart it?


Actually, this state can freeze many of our users and services  
(even if affected server servers RO replicas only... and yes, I  
really don't understand this behavior...) and FS is unable to  
return to normal state at reasonable time (actually / reasonable  
time is pretty small for us/our users...). So, we are trying to  
solve our current problems with fs restart. :-(


(
As you can see from original post, FS is still alive, but has no  
idle threads. Waiting connections (clients) oscillate around 200  
and probably could be serve in tens of minutes...

)


You could have the horrible host callback table mutex lockup  
problem.  The most for-certain way to discover this is to generate a  
core from your running fileserver at the time (on Solaris I use  
gcore, but you could also kill -SEGV it instead of restarting),  
attach a debugger to the core, and see where the threads are  
sitting.  If you've compiled your OpenAFS distribution with --enable- 
debug (which you should), and you examine the stack trace some of the  
threads, you may see a lot of them here:


=[5] CallPreamble(acall = ???, activecall = ???, tconn = ???, ahostp  
= ???) (optimized), at 0x8082178 (line ~315) in afsfileprocs.c

(dbx) list
  315   H_LOCK;
  316 retry:
  317   tclient = h_FindClient_r(*tconn);
  318   thost = tclient-host;
  319   if (tclient-prfail == 1) { /* couldn't get the CPS */
...

If this is the case...well...there's no for-sure way around it right  
now, though some people, IIRC, have been working on some code changes  
to avoid it.  Some steps you can take, though, to mitigate the  
problem involve making sure all your clients respond promptly on  
their AFS callback ports (7001/udp).  With all of the packet manglers  
out on the network (hostbased firewalls, overanxious network  
administrators, etc.) you may find things in the way of the AFS  
fileservers contacting their clients on the callback port.  One of  
the things that can cause this type of lockup are requests to these  
clients timing out / taking a long time...  If things have been  
working fine for awhile and now they don't, network topology/ 
firewall changes like this could be a culprit.


I've attached a script that I periodically run to see how many bad  
clients are using my fileservers, so that I may try to track them  
down and swat at them...


-

#!/usr/local/bin/perl

$| = 1;

sub getclients {
my $server = shift @_;

my %ips;

print STDERR getting connections for $server\n;

open(RXDEBUG, /usr/afsws/etc/rxdebug -allconnections   
$server|) || die

cannot exec rxdebug\n;

while(RXDEBUG) {

if ( /Connection from host ([^, ]+)/ ) {
my $ip = $1;
if ( ! defined($ips{$ip}) ) {
$ips{$ip} = $ip;
}
}
}

close RXDEBUG;

return keys(%ips);
}

sub checkcmdebug {
my $client = shift @_;

print STDERR checking $client\n;

open(CMDEBUG, /usr/afsws/bin/cmdebug -cache $client 21|)  
|| die canot exec cmdebug\n;


while(CMDEBUG) {
if ( /server or network not responding/ ) {
return 0;
}
}
close CMDEBUG;
return 1;
}

my %clients;

# modify this to run getclients on all of your AFS servers...

foreach my $y ( ifs1, ifs2, hfs1, hfs2, bfs1, hfs11,  
hfs12 ) {

foreach my $x ( getclients($y..afs.umbc.edu) ) {
$clients{$x}++;
}
}


use Socket;

foreach my $x ( keys(%clients) ) {
if ( ! checkcmdebug($x) ) {
print $x;
use Socket;
my $iaddr = inet_aton($x);
my $name = gethostbyaddr($iaddr, AF_INET);
print ($name)\n;
}
}
___

Re: [OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown

2006-10-06 Thread Lars Schimmer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Banz wrote:

 First, upgrade your fileserver an actual production release, such
 as 1.4.1.  1.3.81 was pretty good, but, not without problems.  (1.4.1
 is not without problems, but with less.)

 We are thinking of that as a one (last) of possibility, but we are
 running tens of linux (Debian/stable) servers (not only AFS) as a part
 of our distributed computing environment and we are trying to keep our
 server configuration as close as possible to stable dist. And short
 summary: we don't have any significant AFS problems with same
 configuration for 1+years...
 
 Keeping with random linux distro's idea of stable for your AFS code is
 not a good idea.  Stick with OpenAFS's idea of stable -- and while for
 short periods I've ran development (e.g late 1.3.*) code on my
 production AFS servers when I was in a pinch, stick to the production
 releases.  Ignore what Debian thinks, because they don't know what
 they're talking about ;)

The problem is, OpenAFS isn´t synced with distros developement.
OpenAFS got better and better on a very quick timeslot (a BIG thx for
all the team behind this). So far it wasn´t easy for the stable
distros to select the correct version to take. At least in sarge it´s a
problem, because sarge is stable for some time...
In next Debian there will be a 1.4.x version. And I think the 1.4.1+
versions are much stable than 1.3.8x and so better to keep stable for
some time.
At least there ARE some 1.4.x builts of OpenAFS for debian sarge easy to
install.
Nevertheless we use 1.4.1 and 1.4.fc2/4 in production here on debian
sarge and it works well enough for our small cell.
Most of our problems are with the windows version, which is not as
mature as the linux version (in my opinion), which doesn´t mean it isn´t
as stable as the linux version.

MfG,
Lars Schimmer
- --
- -
TU Graz, Institut für ComputerGraphik  WissensVisualisierung
Tel: +43 316 873-5405   E-Mail: [EMAIL PROTECTED]
Fax: +43 316 873-5402   PGP-Key-ID: 0x4A9B1723
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (MingW32)

iD8DBQFFJnp2mWhuE0qbFyMRAt9oAJ99ImBagpENURWmABPxONOMYc+K7ACeJcob
pWepS+6uze+lcOb/deA5zio=
=lX5y
-END PGP SIGNATURE-
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info