Hi Ian,
The log doesn't include any errors, so I have to assume the server is
crashing before writing any to the log. Is the server compiled with
debug symbols? Is there a core dump on the node where the server
died? If so, can you send it to me? You might need to re-configure
and re-comile the source with debugging symbols enabled:
make clean
CFLAGS=-g ./configure --enable-strict ....
make
Thanks,
-sam
On Nov 19, 2007, at 11:15 AM, Ian E. Morgan wrote:
I have been investigating pvfs2 for use on a small 10-node cluster,
and have been having some random failures while simply copying data
into the filesystem.
10 servers each sharing 400GB into a 3.6TiB filesystem. Mounted on all
10 nodes via pvfs2 kernel module and pvfs2-client. During heavy
writing to the FS, one instance of pvfs2server (at random) will
typically die after anywhere from 10-30 minutes.
Each node is handling both data/metadata. During an earliler config
where only one server handles metadata, it was that one metadata
node's server than crashed, so I suspent it's related to the metadata
handling as opposed to the data.
At the moment, my testing has simply been copying 2TiB of data into
the PVFS2 volume, by having all 10 nodes copy their local data into
the shared volume.
These nodes have been pretty rock solid until I tried running a
clustered filesystem. .Having all sorts of trouble with GlusterFS, so
have been on a hunt for something more stable.
On the advice of "robl", I have enabled 'pvfs2-set-debugmask -m
/mnt/rfd verbose'.
Once a node failed again:
<robl> iemorgan: the last few lines should be enough :>
<iemorgan> the end of the server log for the node that failed is:
<iemorgan> [D 11/19 11:31] *** starting delayed ops if any
(state is
LIST_PROC_ALLPOSTED)
[D 11/19 11:31] lebf_encode_rel
[D 11/19 11:31] op_queue add: 0xb4218780
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: set_info: 135678016
option: 6
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: searching for ref
135678016
[D 11/19 11:31] flowproto-multiqueue trove_write_callback_fn,
error_code: 0, flo
w: 0x8158420.
[D 11/19 11:31] [BMI CONTROL]: BMI_set_info: decremented ref
135678016 to: 0
[D 11/19 11:31] DBPF I/O ops in progress: 0
[D 11/19 11:31] flowproto completing 0x8158420
<robl> iemorgan: huh. ok, this all looks cryptic-but-normal to me.
might be time to bring in the big guns
([EMAIL PROTECTED] mailing list)
So I attach a good size chunk of the tail of the server log from the
failure node. The log continues right up until the server process
died.
I hope someone can help narrow down the problem, then maybe we can
fine-tune the debugmask to a more specific area of interest or
identify/resolve the problem outright.
--
Ian Morgan
Software Developer
Teledyne Controls Simulation Ltd.
1-5480 Canotek Rd.
Ottawa, ON K1J 9H5
613-749-6980
x354<b9.svr.log.gz>_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users