Re: [Pvfs2-developers] Operation cancelled (possibly due to timeout) error

Phil Carns Fri, 17 Oct 2008 07:53:15 -0700

Hi Brian,

The second error message that you reported (final-response.sm line 127)is a minor bug that was triggered by the timeouts. That has been fixedin CVS, and you can find a description and patch at the following linksif you want to try it:


http://www.pvfs.org/fisheye/changelog/PVFS/?cs=MAIN:pcarns:20081008183827
http://www.pvfs.org/fisheye/rdiff/PVFS?csid=MAIN:pcarns:20081008183827&u&N

Your main error messages do seem to indicate that something on yoursystem isn't keeping up, though. It is hard to tell if it is thenetwork or the disk that stalled, but your information about the SANdoes seem to implicate the disk.

I have two configuration suggestions that you can try. First, modifythe <StorageHints> section in the server configuration file to includethis line:


   TroveMethod alt-aio

That will switch the disk I/O method in PVFS to a faster mechanism.This worked fine in the 2.7.1 release but it was not yet the default then.

Secondly, as far as timeouts are concerned, I would start by increasing"ServerJobFlowTimeoutSecs", from 30 to maybe 300.

None of the things that you are seeing will harm your data, but yoursystem certainly won't perform very well.

One final configuration option that you can try is changing"TroveSyncData no" to "TroveSyncData yes". I would suggest saving thatone for last after you have resolved your timeout problem, and then tryyour benchmark with both settings. On some SANs you may have betterperformance with syncing enabled but I don't know how to find out exceptto test it and see.


Good luck, and let us know what you find.

-Phil

[EMAIL PROTECTED] wrote:

Hello,

 I am continuing my pvfs2 test, and found another problem. I
will explain the new configuration since it has changed since
my last mail:

- 1 host with ~30Gb ext3 slice mounted from a SAN via Qlogic
  FC acting as metadata server and client;
- 5 hosts with ~400Gb ext3 slice each as above, acting as I/O
  server and clients;
- 24 hosts acting as clients only.
- Debian 4.0, kernel .2.6.24, pvfs2 module 2.7.1.

Well, the test I am doing executes the following for every
machine (29 in total) in the cluster, except metadata server:

iozone -Cce -s8g -r256k -i0 -i1 -t4 -F /mnt/test/iozone{1,2,3,4}

which should write 32gb of data in 4 files of 8Gb each; then
rewrite and read those data.

While the machine are still in the first test (writing), the
server logs start being filled with the following:

----- cut here -----
[E 10/15 15:51] handle_io_error: flow proto error cleanup started
on 0x2aaab40252c0: Operation cancelled (possibly due to timeout)

[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
canceled 3 operations, will clean up.

[E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
error cleanup finished: Operation cancelled (possibly due to timeout)
----- cut here -----

Only once I found the following too:

----- cut here -----
[E 10/15 15:54] src/server/final-response.sm line 127: Error: PINT_encode=
()
failure.
[E 10/15 15:54]         [bt] pvfs2-server [0x4507fb]
[E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_invoke+0xe8)
[0x440d28]
[E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_next+0xc9)
[0x441049]
[E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_continue+0x1=
e)
[0x440b9e]
[E 10/15 15:54]         [bt] pvfs2-server(main+0xe3e) [0x41215e]
[E 10/15 15:54]         [bt] /lib/libc.so.6(__libc_start_main+0xe6)
[0x2b3b3d1811a6]
[E 10/15 15:54]         [bt] pvfs2-server [0x40f7d9]
[E 10/15 15:54] Server Response 0x2aaaac032690 is of type:
PVFS_SERV_SMALL_IO
[E 10/15 15:54] FIXME: unimplemented resp type to print
----- cut here -----

I can see the above errors only on three of the five I/O servers.
Precisely those servers which are using a 'logical volume' from the same
'virtual disk' in the SAN. The other two I/O servers are using a
different virtual disk and show no errors.

Is it possible that the error reported by pvfs is actually a SAN/FC related
error in which it says that the SAN is too much loaded? This would explain
why only three servers are having problems...

Is there any chance these errors can harm the data being written?

Should increasing the timeout be the solution, which of the following
parameters should I modify: ServerJobBMITimeoutSecs,
ServerJobFlowTimeoutSecs,
ClientJobBMITimeoutSecs, ClientJobFlowTimeoutSecs, ClientRetryLimit,
ClientRetryDelayMilliSecs?


Thank you very much,

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Operation cancelled (possibly due to timeout) error

Reply via email to