I have been investigating pvfs2 for use on a small 10-node cluster, and have been having some random failures while simply copying data into the filesystem.
10 servers each sharing 400GB into a 3.6TiB filesystem. Mounted on all
10 nodes via pvfs2 kernel module and pvfs2-client. During heavy
writing to the FS, one instance of pvfs2server (at random) will
typically die after anywhere from 10-30 minutes.
Each node is handling both data/metadata. During an earliler config
where only one server handles metadata, it was that one metadata
node's server than crashed, so I suspent it's related to the metadata
handling as opposed to the data.
At the moment, my testing has simply been copying 2TiB of data into
the PVFS2 volume, by having all 10 nodes copy their local data into
the shared volume.
These nodes have been pretty rock solid until I tried running a
clustered filesystem. .Having all sorts of trouble with GlusterFS, so
have been on a hunt for something more stable.
On the advice of "robl", I have enabled 'pvfs2-set-debugmask -m
/mnt/rfd verbose'.
Once a node failed again:
<robl> iemorgan: the last few lines should be enough :>
<iemorgan> the end of the server log for the node that failed is:
<iemorgan> [D 11/19 11:31] *** starting delayed ops if any (state
is
LIST_PROC_ALLPOSTED)
[D 11/19 11:31] lebf_encode_rel
[D 11/19 11:31] op_queue add: 0xb4218780
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: set_info: 135678016 option: 6
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: searching for ref 135678016
[D 11/19 11:31] flowproto-multiqueue trove_write_callback_fn, error_code: 0, flo
w: 0x8158420.
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: decremented ref 135678016 to: 0
[D 11/19 11:31] DBPF I/O ops in progress: 0
[D 11/19 11:31] flowproto completing 0x8158420
<robl> iemorgan: huh. ok, this all looks cryptic-but-normal to me.
might be time to bring in the big guns
([EMAIL PROTECTED] mailing list)
So I attach a good size chunk of the tail of the server log from the
failure node. The log continues right up until the server process
died.
I hope someone can help narrow down the problem, then maybe we can
fine-tune the debugmask to a more specific area of interest or
identify/resolve the problem outright.
--
Ian Morgan
Software Developer
Teledyne Controls Simulation Ltd.
1-5480 Canotek Rd.
Ottawa, ON K1J 9H5
613-749-6980 x354
b9.svr.log.gz
Description: GNU Zip compressed data
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
