[EMAIL PROTECTED] wrote on Thu, 11 Oct 2007 13:45 -0500:
> I am testing an IA64 PVFS client using IB. The client seems to behave
> and work well most of the time. For some reason under certain uses the
> pvfs IA64 client hangs. Can't quite reproduce the hang but it happens
> often enough to be annoying.
This advice won't be too satisfying then. Reproducability is key to
figuring out problems. Suggest you run the latest PVFS version,
first of all.
And if you can do some simple testing that can cause the problem,
that will help immensely. Bonus points for using "pvfs2-cp" or MPI
programs directly, rather than going through the kernel interface.
> On the client and on the Metadata server I find messages like this:
>
> IA64 client
> ----------------
> [E 11:49:06.762417] job_time_mgr_expire: job time out: cancelling bmi
> operation, job_id: 2425168.
> [E 11:49:06.762762] msgpair failed, will retry: Connection timed out
> [E 11:49:06.762796] *** msgpairarray_completion_fn: msgpair to server
> ib://hpcxe001:3337,tcp://hpcxe001:3336 failed: Connection timed out
> [E 11:49:06.762810] *** Non-BMI failure.
> [E 11:49:06.762823] getattr_object_getattr_failure : Connection timed
> out
> [E 11:54:07.121232] job_time_mgr_expire: job time out: cancelling bmi
> operation, job_id: 2425476.
> [E 11:54:07.121277] job_time_mgr_expire: job time out: cancelling bmi
> operation, job_id: 2425478.
>
> pvfs metadata server
> ---------------------
> hpcxe001: [E 10/11 11:44] job_time_mgr_expire: job time out: cancelling
> bmi operation, job_id: 4432802.
>
>
> pvfs i/o server
> ----------------
> hpcxe005: [E 10/11 11:28] job_time_mgr_expire: job time out: cancelling
> bmi operation, job_id: 5946725.
>
> Anyone know what this means? Anyway to get pvfs-client started in a
> more verbose or debug mode so it can log more info for me to look at?
They all just gave up on each other. Client asked the MD server to
do something, but never got a response. Servers look like they are
trying to send responses, but were never acknowledged, so they gave
up. This sort of pattern happens if the network disappears. (Might
want to synchronize clocks on 001 and 005 to make sure these events
are correlated.)
You can start pvfs2-client with an extra command line argument:
--gossip-mask=client,network
That will put lots of logging into /tmp/pvfs2-client.log though.
You can also tell the servers to change their loglevel dynamically:
pvfs2-set-debugmask -m /pvfs server,network
And there is an EventMask in the fs.conf file that can do the same
thing if you want this to be permanent across server restarts.
Good luck.
-- Pete
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users