Thanks for the log files. I don't see any DB errors like Becky and
Randy were seeing at their site, so this may be a different problem.
One thing that jumps out in the server log file is that there is
occasionally a getattr state machine that starts, but never gets past
the PVFS request scheduler. In this case, the client will eventually
get a timeout like you are seeing in your client side logs.
An example is on line 1433 in your pvfs2-server.log file:
[D 09/16 14:08] (0x2aaab055eff0) getattr (prelude sm) state: req_sched
That (0x2aaab055eff0) number should show up later in the log file as
that getattr state machine proceeds, but it never does. It doesn't look
like it even got to the first step of retrieving metadata.
We need to figure out why that request is getting stuck. What
"EventLogging" setting are you using on the server side to collect this
log? Could you add "access,access_detail" to the list and try to
capture a failure case again?
You can also use pvfs2-set-debugmask to set this at runtime without
restarting the server if that helps any.
thanks,
-Phil
Kumar, Amit H. wrote:
Hi Everyone!
Thank you for responding. I enabled PVFS2_DEBUGMASK to see what was going on,
before I responded.
Attached are the /tmp/pvfs2-client.log and /tmp/pvfs2-server.log and
/var/log/pvfs2.debug.log(PVFS2_DEBUGFILE) files
I do have the latest version, but not from the CVS. And I don't have a backup,
so we will potentially lose some data :-(
I ran pvfs-fsck on the metadata server that has a local mount of the pvfs2-FS, and get the following errors/warnings. Surprisingly I don't have any problem browsing the file system until I hit the files with corrupt attributes.
# /opt/pvfs2/bin/pvfs2-fsck -vp -m /scratch/pvfs2
# Current FSID is 1415574627.
[E 12:09:56.029492] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 18.
[E 12:09:56.030046] Warning: msgpair failed to ib://pvfs2-io-0-0.local-ib:3335,
will retry: Connection timed out
[E 12:09:56.030118] *** msgpairarray_completion_fn: msgpair to server
ib://pvfs2-io-0-0.local-ib:3335 failed: Connection timed out
[E 12:09:56.030164] *** Non-BMI failure.
PVFS_mgmt_setparam_list: Connection timed out (error class: 0)
I believe the time out is a result of not being able to access the files with
corrupt attributes.
Our Metadata server & one of the I/O server =>[r...@pvfs2-io-0-0 tmp]# ps -ef |
grep pvfs
root 19964 1 0 Sep14 ? 00:24:02 /opt/pvfs2/sbin/pvfs2-server
--pidfile /var/run/pvfs2.pid -a pvfs2-io-0-0.local-ib
/opt/pvfs2/etc/pvfs2-fs.conf
root 20434 1 0 Sep14 ? 00:00:00 /opt/pvfs2/sbin/pvfs2-client -p
/opt/pvfs2/sbin/pvfs2-client-core
root 20435 20434 0 Sep14 ? 00:00:05 pvfs2-client-core --child -a 5
-n 5 --logtype file -L /tmp/pvfs2-client.log
root 23623 19719 0 15:28 pts/2 00:00:00 grep pvfs
Any thoughts??? greatly appreciated!!!
-Amit
-----Original Message-----
From: Phil Carns [mailto:[email protected]]
Sent: Wednesday, September 16, 2009 2:55 PM
To: [email protected]
Cc: Kumar, Amit H.; [email protected]
Subject: Re: [Pvfs2-developers] PVFS2: files with ?---?--? permissions
Becky Ligon wrote:
Amit:
This means that the PVFS system cannot access the attributes database
containing the information about the particular file. It also means
that
the file is unusable. You need to determine which metadata server is
having problems. If you don't have a backup, then you may not be
able to
recover your file.
Here at Clemson, we just went through this painful process, when one
of
the Berkeley DB holding metadata became corrupt. You might try
db_recover, but it didn't help us. You might also try pvfs2-fsck.
If
your file is unrecoverable, then pvfs2-fsck just simply cleans up the
orphan objects.
Becky
Hi Amit,
Do you have anything in your server logs or the /tmp/pvfs2-client.log
file on the client side?
thanks,
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers