I was able to replicate it in a little bit simpler environment this
afternoon. It looks like the problem is with the statfs and/or mount
upcalls.
The problem with those two is that they are serviced in
pvfs2-client-core using blocking functions- so if one of them hangs on a
long network timeout then no other operations (even to other file
systems) can be processed.
The kernel module has an operation timeout value that is independent of
the BMI timeout that pvfs2-client uses; therefore even once the statfs
command has timed out the pvfs2-client daemon is probably still hung for
a relatively long time (ClientJobBMITimeoutSecs * ClientRetryLimit seconds).
I will look into seeing if it is possible to make nonblocking versions
of these service functions...
-Phil
Sam Lang wrote:
On Feb 23, 2006, at 9:35 AM, David Metheny wrote:
This seems to happen on a 2.6 kernel also. I'm using a 2.6.9-22 on a
RHEL4
client.
I also attempted this with the network going away on a pvfs2 server
node. I
issued a
"ifdown eth0 && sleep 200 && ifup eth0" on a pvfs2 server node on the
/mnt/pvfs2 file system. I went through the same process of issuing a
"df" on
the /mnt/pvfs2, getting a connection timed out, then a "df" on the
/mnt/pvfs2-tmp, and got a connection timed out also. I watched (ping)
the
pvfs2 server node where eth0 was brought down, and immediately after
eth0
came back up, I issued a "df" on /mnt/pvfs2-tmp again. It worked at this
point.
Hi David,
I get a little different behavior. If I create a network partition
between client and server2 nodes, and then do a df -h <mnt1>. I get an
operation timed out error on the first attempt, but repeated attempts
are successful. Also, when I do df -h <mnt2> my error is a little
different. Instead of connection timed-out, I get a Invalid Argument
(EINVAL). Not sure what's up with that. I'll keep looking into the
initial connection timed-out behavior, just wanted to give you an update.
-sam
-----Original Message-----
From: David Metheny [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 23, 2006 8:27 AM
To: 'Sam Lang'
Cc: '[email protected]'
Subject: RE: [Pvfs2-developers] Problem with multiple pvfs2
file systems mounted on a single client
I wasn't able to reproduce the problem by just killing the
server process. I tried both killing the server process and
powering off the server and the client handled errors from
the killing of the server process fine.
I was using a 2.4.21-27 kernel on a RHEL3 client... I'll see
if I can reproduce on a 2.6 kernel.
-----Original Message-----
From: Sam Lang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 22, 2006 4:48 PM
To: [EMAIL PROTECTED]
Cc: [email protected]
Subject: Re: [Pvfs2-developers] Problem with multiple pvfs2 file
systems mounted on a single client
Hi David,
I tried to reproduce your results with the 2.6 kernel, and
wasn't able
to. Are you using 2.4? Also, I didn't actually pull the
plug on one
of the nodes, I just killed the server, but that should be close
enough to your test case unless you're routing stuff
through that node
;-).
-sam
On Feb 22, 2006, at 12:16 PM, David Metheny wrote:
It appears the error described below will span across
other mounted
file systems on a client when encountered, until the client
software
is reloaded.
I've got a client with 2 pvfs2 file systems mounted:
/mnt/pvfs2
/mnt/pvfs2-tmp
Both PVFS2 file system configurations contained the following when
mounted:
ServerJobBMITimeoutSecs 30
ServerJobFlowTimeoutSecs 30
ClientJobBMITimeoutSecs 300
ClientJobFlowTimeoutSecs 300
ClientRetryLimit 5
ClientRetryDelayMilliSecs 2000
I've dynamically changed the clients timeout settings after the
mounts:
[EMAIL PROTECTED] root]# /sbin/sysctl -w pvfs2.op-timeout-secs=5
A pvfs2 server node lost power on the /mnt/pvfs2 file
system. After
issuing a "df -h /mnt/pvfs2", the client received a "connection
timed-out"
error.
[EMAIL PROTECTED] root]# df -h /mnt/pvfs2
Filesystem Size Used Avail Use% Mounted on
df: `/mnt/pvfs2': Connection timed out
An immediate subsequent "df -h /mnt/pvfs2-tmp" also returned
"connection timed out"
[EMAIL PROTECTED] root]# df -h /mnt/pvfs2-tmp
df: `/mnt/pvfs2-tmp': Connection timed out
An unmount of the /mnt/pvfs2 shared works fine.
[EMAIL PROTECTED] root]# umount /mnt/pvfs2
Another subsequent ""df -h /mnt/pvfs2-tmp" still returns
"connection
timed out"
[EMAIL PROTECTED] root]# df -h /mnt/pvfs2-tmp
df: `/mnt/pvfs2-tmp': Connection timed out
After unloading the userspace and kernel module, restarting pvfs2
software, and remounting the /mnt/pvfs2-tmp filesystem, a "df -h
/mnt/pvfs2-tmp"
successfully completed
[EMAIL PROTECTED] root]# df -h /mnt/pvfs2-tmp
Filesystem Size Used Avail Use% Mounted on
hostname:3334/pvfs2-fs
1.9T 381G 1.6T 20% /mnt/pvfs2-tmp
The pvfs2 client log contained:
[E 02/22 11:28] msgpair failed, will retry:: Connection refused [E
02/22 11:28] msgpair failed, will retry:: Connection
refused [E 02/22
11:28] msgpair failed, will retry:: Connection refused [E
02/22 11:29]
msgpair failed, will retry:: Connection refused [E 02/22 11:29]
msgpair failed, will retry:: Connection refused [E 02/22 11:29]
msgpair failed, will retry:: Connection refused [E 02/22
11:29] ***
msgpairarray_completion_fn: msgpair to server
tcp://hvcwydev0329:3334 failed: Connection refused [E
02/22 11:29]
*** Out of retries.
[E 02/22 11:29] Statfs failed: Connection refused [E 02/22 11:36]
msgpair failed, will retry:: Operation cancelled (possibly due to
timeout) [E 02/22 11:39] msgpair failed, will retry::
Connection timed
out [E 02/22 11:42] msgpair failed, will retry:: Connection
timed out
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers