[Pvfs2-users] PVFS2 v.1.5.1 'Job time out' on some pvfs2-cp and pvfs2-rm

Mark Van De Vyver Thu, 15 Feb 2007 17:49:26 -0800

Hi,
Thank you for all the effort put into making PVFS2 available.
I'm relatively new to Linux (from WinXP), and have built a 3 node
cluster using the Rocks  Cluster software v4.2.1.  I've installed the
PVFS2 roll and by following the PVFS2 roll guide all has proceeded
very smoothly - really, thanks - I'd expected a few days/weeks to get
to this point.


At the end of this email I pose some questions that the following
behavior has raised.

About my set-up:
A single user.  I made no changes to the PVFS configuration
established by the PVFS2 roll, and have one head node and two
compute-I/O nodes.
PVFS version 1.5.1

The unexpected behavior:
Using pvfs2-cp I have copied approx 900GB of files from serval DVD
using dd (I dd to a tmpfs area then pvfs2-cp this 'image' to
/mnt/pvfs2/some/path).
I have noticed that this runs fine so long as it is the first time the
file is copied.  If I use pvfs2-rm to delete a file, not necessarily
from the same node used to make the copy, the following occurs (all
nodes seems to be up and working fine):
- I can see the file is removed using the gnome file browser.
- The pvfs2-rm seems to hang, and the hollowing message is displayed:

[E 15:10:02.584608] Job time out: cancelling bmi operation, job_id: 21.
[E 15:10:02.584769] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)

If I try to re-copy the file (using pvfs2-cp), again, not necessarily
from the same node it was first copied on, then I see and the copy
fails.

[E 15:26:53.690560] Job time out: cancelling bmi operation, job_id: 25.
[E 15:26:53.690710] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
[E 15:26:53.690733] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-compute-0-1:3334 failed: Operation cancelled (possibly due
to timeout)
[E 15:26:53.690743] *** No retries requested.
pvfs2-cp: src/client/sysint/sys-getattr.sm:331: getattr_acache_lookup:
Assertion `object_ref.handle != ((PVFS_handle)0)' failed.
/

On rebooting one of the nodes I was forced to run fsck, after this the
cluster seems  to have returned to 'normal'.

The good news is that the std linux commands: cp and rm don't seem to
have any trouble, so I am using those at the moment..... I couldn't
find any advice that cp, etc, is preferred to pvfs2-cp, or vice versa.

1) Is this a known issue that is fixed in PVFS 2.6?
2) Is it fine to continue to use v1.5.1 so long as I don't use the
PVFS-* commands?
3) Is upgrading to v2.6 on a rocks cluster 'straight forward', or is
it likely to involve some 'debugging' and a few days work - bear in
mind my relative inexperience with Linux.

Regards
Mark
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

[Pvfs2-users] PVFS2 v.1.5.1 'Job time out' on some pvfs2-cp and pvfs2-rm

Reply via email to