It's an ubuntu server, 2.6.24-24-server 64 bits pvfs2 is 2.8.2 I've 1 client that loops calling syscall(SYS_io_submit,...
On 17 June 2011 14:02, Michael Moore <[email protected]> wrote: > What version of OrangeFS/PVFS and what distro/kernel version is used in the > setup? To re-create it, just a stream of simple write() calls from a single > client or something more involved? > > Thanks, > Michael > > > On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano < > [email protected]> wrote: > >> Thanks Michael >> >> I've tried setting alt-aio as TroveMethod and the problem is still there. >> >> Some logs: >> >> Client (blade39) says: >> >> [E 13:36:19.590763] server: tcp://blade60:3334 >> [E 13:36:19.591006] io_process_context_recv (op_status): No such file or >> directory >> [E 13:36:19.591018] server: tcp://blade61:3334 >> [E 13:36:19.768105] io_process_context_recv (op_status): No such file or >> directory >> >> Servers: >> >> blade58: >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x7f5cac004370: Connection reset by peer >> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error cleanup >> finished: Connection reset by peer >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x7f5cac0ee8f0: Connection reset by peer >> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error cleanup >> finished: Connection reset by peer >> >> blade59: >> [E 06/17 13:37] trove_write_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x799410: Broken pipe >> [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup >> finished: Broken pipe >> >> blade60: >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x7fb0a012bed0: Connection reset by peer >> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error cleanup >> finished: Connection reset by peer >> >> blade61: >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x76b5a0: Broken pipe >> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup >> finished: Broken pipe >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x778e00: Broken pipe >> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup >> finished: Broken pipe >> >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x76b5a0: Broken pipe >> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup >> finished: Broken pipe >> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >> 0x778e00: Broken pipe >> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 >> operations, will clean up. >> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup >> finished: Broken pipe >> >> Vincenzo >> >> On 17 June 2011 13:29, Michael Moore <[email protected]> wrote: >> >>> Hi Vincenzo, >>> >>> This sounds similar to an issue just reported by Benjamin Seevers here on >>> the developers list: >>> >>> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html >>> >>> Based on his experience with the issue if you switch to alt-aio instead >>> of directio the corruption no longer occurs. Could you try switching from >>> directio to alt-aio in your configuration to help isolate if this is a >>> similar or different issue? If that doesn't resolve the issue, could you >>> provide what errors, if any, you see on the client when it fails and what >>> errors, if any, appear in the pvfs2-server logs? >>> >>> Thanks, >>> Michael >>> >>> On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano < >>> [email protected]> wrote: >>> >>>> Hi, >>>> I'm using the following setup: >>>> 4 machines used as I/O server >>>> 10 machines used as I/O client >>>> >>>> The configuration file is the following: >>>> >>>> <Defaults> >>>> UnexpectedRequests 50 >>>> EventLogging none >>>> EnableTracing no >>>> LogStamp datetime >>>> BMIModules bmi_tcp >>>> FlowModules flowproto_multiqueue >>>> PerfUpdateInterval 1000 >>>> ServerJobBMITimeoutSecs 30 >>>> ServerJobFlowTimeoutSecs 30 >>>> ClientJobBMITimeoutSecs 300 >>>> ClientJobFlowTimeoutSecs 300 >>>> ClientRetryLimit 5 >>>> ClientRetryDelayMilliSecs 2000 >>>> PrecreateBatchSize 512 >>>> PrecreateLowThreshold 256 >>>> TCPBufferSend 524288 >>>> TCPBufferReceive 524288 >>>> StorageSpace /local/vincenzo/pvfs2-storage-space >>>> LogFile /tmp/pvfs2-server.log >>>> </Defaults> >>>> >>>> <Aliases> >>>> Alias blade58 tcp://blade58:3334 >>>> Alias blade59 tcp://blade59:3334 >>>> Alias blade60 tcp://blade60:3334 >>>> Alias blade61 tcp://blade61:3334 >>>> </Aliases> >>>> >>>> <Filesystem> >>>> Name pvfs2-fs >>>> ID 1615492168 >>>> RootHandle 1048576 >>>> FileStuffing yes >>>> <MetaHandleRanges> >>>> Range blade58 3-1152921504606846977 >>>> Range blade59 1152921504606846978-2305843009213693952 >>>> Range blade60 2305843009213693953-3458764513820540927 >>>> Range blade61 3458764513820540928-4611686018427387902 >>>> </MetaHandleRanges> >>>> <DataHandleRanges> >>>> Range blade58 4611686018427387903-5764607523034234877 >>>> Range blade59 5764607523034234878-6917529027641081852 >>>> Range blade60 6917529027641081853-8070450532247928827 >>>> Range blade61 8070450532247928828-9223372036854775802 >>>> </DataHandleRanges> >>>> <StorageHints> >>>> TroveSyncMeta no >>>> TroveSyncData no >>>> TroveMethod directio >>>> </StorageHints> >>>> </Filesystem> >>>> >>>> I'm testing the system writing (continuously) from 1 client machine >>>> chunks of 500K. After few seconds, the client is not able to write. >>>> Checking >>>> manually the file system, I can see my file (running ls) and it seems to be >>>> corrupted (no information about the file is given and I cannot remove the >>>> file). The only solution is to stop all clients / servers and re-create the >>>> file system. >>>> >>>> Thanks in advance >>>> >>>> Vincenzo >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >>> >> >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
