I think there must be a problem with the client (kernel) side aio
support in PVFS. There is a related bug report from a while back:
http://www.beowulf-underground.org/pipermail/pvfs2-users/2010-February/003045.html
The libaio library described in that bug report uses the io_submit()
system call as well.
-Phil
On 06/17/2011 08:20 AM, Vincenzo Gulisano wrote:
It's an ubuntu server, 2.6.24-24-server 64 bits
pvfs2 is 2.8.2
I've 1 client that loops calling syscall(SYS_io_submit,...
On 17 June 2011 14:02, Michael Moore <[email protected]
<mailto:[email protected]>> wrote:
What version of OrangeFS/PVFS and what distro/kernel version is
used in the setup? To re-create it, just a stream of simple
write() calls from a single client or something more involved?
Thanks,
Michael
On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano
<[email protected] <mailto:[email protected]>>
wrote:
Thanks Michael
I've tried setting alt-aio as TroveMethod and the problem is
still there.
Some logs:
Client (blade39) says:
[E 13:36:19.590763] server: tcp://blade60:3334
[E 13:36:19.591006] io_process_context_recv (op_status): No
such file or directory
[E 13:36:19.591018] server: tcp://blade61:3334
[E 13:36:19.768105] io_process_context_recv (op_status): No
such file or directory
Servers:
blade58:
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x7f5cac004370: Connection reset by peer
[E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370
canceled 0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370
error cleanup finished: Connection reset by peer
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x7f5cac0ee8f0: Connection reset by peer
[E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0
canceled 0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0
error cleanup finished: Connection reset by peer
blade59:
[E 06/17 13:37] trove_write_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x799410: Broken pipe
[E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled
0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x799410 error
cleanup finished: Broken pipe
blade60:
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x7fb0a012bed0: Connection reset by peer
[E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0
canceled 0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0
error cleanup finished: Connection reset by peer
blade61:
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x76b5a0: Broken pipe
[E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled
0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error
cleanup finished: Broken pipe
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x778e00: Broken pipe
[E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled
0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x778e00 error
cleanup finished: Broken pipe
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x76b5a0: Broken pipe
[E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled
0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error
cleanup finished: Broken pipe
[E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
[E 06/17 13:37] handle_io_error: flow proto error cleanup
started on 0x778e00: Broken pipe
[E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled
0 operations, will clean up.
[E 06/17 13:37] handle_io_error: flow proto 0x778e00 error
cleanup finished: Broken pipe
Vincenzo
On 17 June 2011 13:29, Michael Moore <[email protected]
<mailto:[email protected]>> wrote:
Hi Vincenzo,
This sounds similar to an issue just reported by Benjamin
Seevers here on the developers list:
http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html
Based on his experience with the issue if you switch to
alt-aio instead of directio the corruption no longer
occurs. Could you try switching from directio to alt-aio
in your configuration to help isolate if this is a similar
or different issue? If that doesn't resolve the issue,
could you provide what errors, if any, you see on the
client when it fails and what errors, if any, appear in
the pvfs2-server logs?
Thanks,
Michael
On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
I'm using the following setup:
4 machines used as I/O server
10 machines used as I/O client
The configuration file is the following:
<Defaults>
UnexpectedRequests 50
EventLogging none
EnableTracing no
LogStamp datetime
BMIModules bmi_tcp
FlowModules flowproto_multiqueue
PerfUpdateInterval 1000
ServerJobBMITimeoutSecs 30
ServerJobFlowTimeoutSecs 30
ClientJobBMITimeoutSecs 300
ClientJobFlowTimeoutSecs 300
ClientRetryLimit 5
ClientRetryDelayMilliSecs 2000
PrecreateBatchSize 512
PrecreateLowThreshold 256
TCPBufferSend 524288
TCPBufferReceive 524288
StorageSpace /local/vincenzo/pvfs2-storage-space
LogFile /tmp/pvfs2-server.log
</Defaults>
<Aliases>
Alias blade58 tcp://blade58:3334
Alias blade59 tcp://blade59:3334
Alias blade60 tcp://blade60:3334
Alias blade61 tcp://blade61:3334
</Aliases>
<Filesystem>
Name pvfs2-fs
ID 1615492168
RootHandle 1048576
FileStuffing yes
<MetaHandleRanges>
Range blade58 3-1152921504606846977
Range blade59 1152921504606846978-2305843009213693952
Range blade60 2305843009213693953-3458764513820540927
Range blade61 3458764513820540928-4611686018427387902
</MetaHandleRanges>
<DataHandleRanges>
Range blade58 4611686018427387903-5764607523034234877
Range blade59 5764607523034234878-6917529027641081852
Range blade60 6917529027641081853-8070450532247928827
Range blade61 8070450532247928828-9223372036854775802
</DataHandleRanges>
<StorageHints>
TroveSyncMeta no
TroveSyncData no
TroveMethod directio
</StorageHints>
</Filesystem>
I'm testing the system writing (continuously) from 1
client machine chunks of 500K. After few seconds, the
client is not able to write. Checking manually
the file system, I can see my file (running ls) and it
seems to be corrupted (no information about the file
is given and I cannot remove the file). The only
solution is to stop all clients / servers and
re-create the file system.
Thanks in advance
Vincenzo
_______________________________________________
Pvfs2-users mailing list
[email protected]
<mailto:[email protected]>
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users