What version of OrangeFS/PVFS and what distro/kernel version is used in the setup? To re-create it, just a stream of simple write() calls from a single client or something more involved?
Thanks, Michael On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano < [email protected]> wrote: > Thanks Michael > > I've tried setting alt-aio as TroveMethod and the problem is still there. > > Some logs: > > Client (blade39) says: > > [E 13:36:19.590763] server: tcp://blade60:3334 > [E 13:36:19.591006] io_process_context_recv (op_status): No such file or > directory > [E 13:36:19.591018] server: tcp://blade61:3334 > [E 13:36:19.768105] io_process_context_recv (op_status): No such file or > directory > > Servers: > > blade58: > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x7f5cac004370: Connection reset by peer > [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0 > operations, will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error cleanup > finished: Connection reset by peer > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x7f5cac0ee8f0: Connection reset by peer > [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0 > operations, will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error cleanup > finished: Connection reset by peer > > blade59: > [E 06/17 13:37] trove_write_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x799410: Broken pipe > [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0 operations, > will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup > finished: Broken pipe > > blade60: > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x7fb0a012bed0: Connection reset by peer > [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0 > operations, will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error cleanup > finished: Connection reset by peer > > blade61: > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x76b5a0: Broken pipe > [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 operations, > will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup > finished: Broken pipe > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x778e00: Broken pipe > [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 operations, > will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup > finished: Broken pipe > > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x76b5a0: Broken pipe > [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 operations, > will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup > finished: Broken pipe > [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred > [E 06/17 13:37] handle_io_error: flow proto error cleanup started on > 0x778e00: Broken pipe > [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 operations, > will clean up. > [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup > finished: Broken pipe > > Vincenzo > > On 17 June 2011 13:29, Michael Moore <[email protected]> wrote: > >> Hi Vincenzo, >> >> This sounds similar to an issue just reported by Benjamin Seevers here on >> the developers list: >> >> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html >> >> Based on his experience with the issue if you switch to alt-aio instead of >> directio the corruption no longer occurs. Could you try switching from >> directio to alt-aio in your configuration to help isolate if this is a >> similar or different issue? If that doesn't resolve the issue, could you >> provide what errors, if any, you see on the client when it fails and what >> errors, if any, appear in the pvfs2-server logs? >> >> Thanks, >> Michael >> >> On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano < >> [email protected]> wrote: >> >>> Hi, >>> I'm using the following setup: >>> 4 machines used as I/O server >>> 10 machines used as I/O client >>> >>> The configuration file is the following: >>> >>> <Defaults> >>> UnexpectedRequests 50 >>> EventLogging none >>> EnableTracing no >>> LogStamp datetime >>> BMIModules bmi_tcp >>> FlowModules flowproto_multiqueue >>> PerfUpdateInterval 1000 >>> ServerJobBMITimeoutSecs 30 >>> ServerJobFlowTimeoutSecs 30 >>> ClientJobBMITimeoutSecs 300 >>> ClientJobFlowTimeoutSecs 300 >>> ClientRetryLimit 5 >>> ClientRetryDelayMilliSecs 2000 >>> PrecreateBatchSize 512 >>> PrecreateLowThreshold 256 >>> TCPBufferSend 524288 >>> TCPBufferReceive 524288 >>> StorageSpace /local/vincenzo/pvfs2-storage-space >>> LogFile /tmp/pvfs2-server.log >>> </Defaults> >>> >>> <Aliases> >>> Alias blade58 tcp://blade58:3334 >>> Alias blade59 tcp://blade59:3334 >>> Alias blade60 tcp://blade60:3334 >>> Alias blade61 tcp://blade61:3334 >>> </Aliases> >>> >>> <Filesystem> >>> Name pvfs2-fs >>> ID 1615492168 >>> RootHandle 1048576 >>> FileStuffing yes >>> <MetaHandleRanges> >>> Range blade58 3-1152921504606846977 >>> Range blade59 1152921504606846978-2305843009213693952 >>> Range blade60 2305843009213693953-3458764513820540927 >>> Range blade61 3458764513820540928-4611686018427387902 >>> </MetaHandleRanges> >>> <DataHandleRanges> >>> Range blade58 4611686018427387903-5764607523034234877 >>> Range blade59 5764607523034234878-6917529027641081852 >>> Range blade60 6917529027641081853-8070450532247928827 >>> Range blade61 8070450532247928828-9223372036854775802 >>> </DataHandleRanges> >>> <StorageHints> >>> TroveSyncMeta no >>> TroveSyncData no >>> TroveMethod directio >>> </StorageHints> >>> </Filesystem> >>> >>> I'm testing the system writing (continuously) from 1 client machine >>> chunks of 500K. After few seconds, the client is not able to write. Checking >>> manually the file system, I can see my file (running ls) and it seems to be >>> corrupted (no information about the file is given and I cannot remove the >>> file). The only solution is to stop all clients / servers and re-create the >>> file system. >>> >>> Thanks in advance >>> >>> Vincenzo >>> >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >> >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
