Hi Brian,
The second error message that you reported (final-response.sm line 127)
is a minor bug that was triggered by the timeouts. That has been fixed
in CVS, and you can find a description and patch at the following links
if you want to try it:
http://www.pvfs.org/fisheye/changelog/PVFS/?cs=MAIN:pcarns:20081008183827
http://www.pvfs.org/fisheye/rdiff/PVFS?csid=MAIN:pcarns:20081008183827&u&N
Your main error messages do seem to indicate that something on your
system isn't keeping up, though. It is hard to tell if it is the
network or the disk that stalled, but your information about the SAN
does seem to implicate the disk.
I have two configuration suggestions that you can try. First, modify
the <StorageHints> section in the server configuration file to include
this line:
TroveMethod alt-aio
That will switch the disk I/O method in PVFS to a faster mechanism.
This worked fine in the 2.7.1 release but it was not yet the default then.
Secondly, as far as timeouts are concerned, I would start by increasing
"ServerJobFlowTimeoutSecs", from 30 to maybe 300.
None of the things that you are seeing will harm your data, but your
system certainly won't perform very well.
One final configuration option that you can try is changing
"TroveSyncData no" to "TroveSyncData yes". I would suggest saving that
one for last after you have resolved your timeout problem, and then try
your benchmark with both settings. On some SANs you may have better
performance with syncing enabled but I don't know how to find out except
to test it and see.
Good luck, and let us know what you find.
-Phil
[EMAIL PROTECTED] wrote:
Hello,
I am continuing my pvfs2 test, and found another problem. I
will explain the new configuration since it has changed since
my last mail:
- 1 host with ~30Gb ext3 slice mounted from a SAN via Qlogic
FC acting as metadata server and client;
- 5 hosts with ~400Gb ext3 slice each as above, acting as I/O
server and clients;
- 24 hosts acting as clients only.
- Debian 4.0, kernel .2.6.24, pvfs2 module 2.7.1.
Well, the test I am doing executes the following for every
machine (29 in total) in the cluster, except metadata server:
iozone -Cce -s8g -r256k -i0 -i1 -t4 -F /mnt/test/iozone{1,2,3,4}
which should write 32gb of data in 4 files of 8Gb each; then
rewrite and read those data.
While the machine are still in the first test (writing), the
server logs start being filled with the following:
----- cut here -----
[E 10/15 15:51] handle_io_error: flow proto error cleanup started
on 0x2aaab40252c0: Operation cancelled (possibly due to timeout)
[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
canceled 3 operations, will clean up.
[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
error cleanup finished: Operation cancelled (possibly due to timeout)
----- cut here -----
Only once I found the following too:
----- cut here -----
[E 10/15 15:54] src/server/final-response.sm line 127: Error: PINT_encode=
()
failure.
[E 10/15 15:54] [bt] pvfs2-server [0x4507fb]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_invoke+0xe8)
[0x440d28]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_next+0xc9)
[0x441049]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_continue+0x1=
e)
[0x440b9e]
[E 10/15 15:54] [bt] pvfs2-server(main+0xe3e) [0x41215e]
[E 10/15 15:54] [bt] /lib/libc.so.6(__libc_start_main+0xe6)
[0x2b3b3d1811a6]
[E 10/15 15:54] [bt] pvfs2-server [0x40f7d9]
[E 10/15 15:54] Server Response 0x2aaaac032690 is of type:
PVFS_SERV_SMALL_IO
[E 10/15 15:54] FIXME: unimplemented resp type to print
----- cut here -----
I can see the above errors only on three of the five I/O servers.
Precisely those servers which are using a 'logical volume' from the same
'virtual disk' in the SAN. The other two I/O servers are using a
different virtual disk and show no errors.
Is it possible that the error reported by pvfs is actually a SAN/FC related
error in which it says that the SAN is too much loaded? This would explain
why only three servers are having problems...
Is there any chance these errors can harm the data being written?
Should increasing the timeout be the solution, which of the following
parameters should I modify: ServerJobBMITimeoutSecs,
ServerJobFlowTimeoutSecs,
ClientJobBMITimeoutSecs, ClientJobFlowTimeoutSecs, ClientRetryLimit,
ClientRetryDelayMilliSecs?
Thank you very much,
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers