Hello,
I am continuing my pvfs2 test, and found another problem. I
will explain the new configuration since it has changed since
my last mail:
- 1 host with ~30Gb ext3 slice mounted from a SAN via Qlogic
FC acting as metadata server and client;
- 5 hosts with ~400Gb ext3 slice each as above, acting as I/O
server and clients;
- 24 hosts acting as clients only.
- Debian 4.0, kernel .2.6.24, pvfs2 module 2.7.1.
Well, the test I am doing executes the following for every
machine (29 in total) in the cluster, except metadata server:
iozone -Cce -s8g -r256k -i0 -i1 -t4 -F /mnt/test/iozone{1,2,3,4}
which should write 32gb of data in 4 files of 8Gb each; then
rewrite and read those data.
While the machine are still in the first test (writing), the
server logs start being filled with the following:
----- cut here -----
[E 10/15 15:51] handle_io_error: flow proto error cleanup started
on 0x2aaab40252c0: Operation cancelled (possibly due to timeout)
[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
canceled 3 operations, will clean up.
[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
error cleanup finished: Operation cancelled (possibly due to timeout)
----- cut here -----
Only once I found the following too:
----- cut here -----
[E 10/15 15:54] src/server/final-response.sm line 127: Error: PINT_encode=
()
failure.
[E 10/15 15:54] [bt] pvfs2-server [0x4507fb]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_invoke+0xe8)
[0x440d28]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_next+0xc9)
[0x441049]
[E 10/15 15:54] [bt] pvfs2-server(PINT_state_machine_continue+0x1=
e)
[0x440b9e]
[E 10/15 15:54] [bt] pvfs2-server(main+0xe3e) [0x41215e]
[E 10/15 15:54] [bt] /lib/libc.so.6(__libc_start_main+0xe6)
[0x2b3b3d1811a6]
[E 10/15 15:54] [bt] pvfs2-server [0x40f7d9]
[E 10/15 15:54] Server Response 0x2aaaac032690 is of type:
PVFS_SERV_SMALL_IO
[E 10/15 15:54] FIXME: unimplemented resp type to print
----- cut here -----
I can see the above errors only on three of the five I/O servers.
Precisely those servers which are using a 'logical volume' from the same
'virtual disk' in the SAN. The other two I/O servers are using a
different virtual disk and show no errors.
Is it possible that the error reported by pvfs is actually a SAN/FC related
error in which it says that the SAN is too much loaded? This would explain
why only three servers are having problems...
Is there any chance these errors can harm the data being written?
Should increasing the timeout be the solution, which of the following
parameters should I modify: ServerJobBMITimeoutSecs,
ServerJobFlowTimeoutSecs,
ClientJobBMITimeoutSecs, ClientJobFlowTimeoutSecs, ClientRetryLimit,
ClientRetryDelayMilliSecs?
Thank you very much,
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers