It appears the call to copy_to_user_page() (line 1516 in src/kernel/linux-2.6/pvfs2-bufmap.c) is the culprit. I don't know yet if it's a matter of incorrect usage/locking/cache management or bad arguments.
Michael On Mon, Jun 20, 2011 at 8:04 PM, Michael Moore <[email protected]> wrote: > I believe it returns ENOSYS if the calls aren't defined. I agree if we > can't get a fix, it is better to disable it. However, the turn-around time > on getting it off in a released version, that version being in use, is > hopefully longer than the resolution time of the issue. > > I've been doing some digging into the issue. First, the one time I ran the > example code from the man page with AIO_MAXIO == 2 (two reads at once) > succeeds, greater than two and it fails. The kernel panics are all over the > place, in code that can't be the problem (scsi, ext3, etc). Which makes me > think we doing something really bad(tm). I've attached a handful of panics > I've collected while looking at this. I think the issue is in > pvfs_bufmap_copy_to_user_task_iovec(). It makes sense in the case that > reads, not writes, cause the panics. However, not sure what the issue is > yet. Any insight or recommendations appreciated. > > Michael > > > On Fri, Jun 17, 2011 at 10:23 AM, Phil Carns <[email protected]> wrote: > >> ** >> What happens if a file system simply doesn't provide the aio functions >> (ie, leaves aio_write, aio_read, etc. set to NULL in the file_operations >> structure)? I wonder if the aio system calls return ENOSYS or if the kernel >> just services them as blocking calls. >> >> At any rate it seems like it might be a good idea to turn off that >> functionality until the bug is fixed so that folks don't get caught off >> guard. >> >> -Phil >> >> >> On 06/17/2011 10:07 AM, Michael Moore wrote: >> >> Good memory, Phil! >> >> Vincenzo, you are welcome to try and upgrade to OrangeFS however I don't >> suspect it will do too much good. Let me get this on our list and take a >> look at it. >> >> Michael >> >> On Fri, Jun 17, 2011 at 9:54 AM, Phil Carns <[email protected]> wrote: >> >>> I think there must be a problem with the client (kernel) side aio >>> support in PVFS. There is a related bug report from a while back: >>> >>> >>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2010-February/003045.html >>> >>> The libaio library described in that bug report uses the io_submit() >>> system call as well. >>> >>> -Phil >>> >>> >>> On 06/17/2011 08:20 AM, Vincenzo Gulisano wrote: >>> >>> It's an ubuntu server, 2.6.24-24-server 64 bits >>> pvfs2 is 2.8.2 >>> >>> I've 1 client that loops calling syscall(SYS_io_submit,... >>> >>> >>> On 17 June 2011 14:02, Michael Moore <[email protected]> wrote: >>> >>>> What version of OrangeFS/PVFS and what distro/kernel version is used in >>>> the setup? To re-create it, just a stream of simple write() calls from a >>>> single client or something more involved? >>>> >>>> Thanks, >>>> Michael >>>> >>>> >>>> On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano < >>>> [email protected]> wrote: >>>> >>>>> Thanks Michael >>>>> >>>>> I've tried setting alt-aio as TroveMethod and the problem is still >>>>> there. >>>>> >>>>> Some logs: >>>>> >>>>> Client (blade39) says: >>>>> >>>>> [E 13:36:19.590763] server: tcp://blade60:3334 >>>>> [E 13:36:19.591006] io_process_context_recv (op_status): No such file >>>>> or directory >>>>> [E 13:36:19.591018] server: tcp://blade61:3334 >>>>> [E 13:36:19.768105] io_process_context_recv (op_status): No such file >>>>> or directory >>>>> >>>>> Servers: >>>>> >>>>> blade58: >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x7f5cac004370: Connection reset by peer >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error >>>>> cleanup finished: Connection reset by peer >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x7f5cac0ee8f0: Connection reset by peer >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error >>>>> cleanup finished: Connection reset by peer >>>>> >>>>> blade59: >>>>> [E 06/17 13:37] trove_write_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x799410: Broken pipe >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup >>>>> finished: Broken pipe >>>>> >>>>> blade60: >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x7fb0a012bed0: Connection reset by peer >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error >>>>> cleanup finished: Connection reset by peer >>>>> >>>>> blade61: >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x76b5a0: Broken pipe >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup >>>>> finished: Broken pipe >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x778e00: Broken pipe >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup >>>>> finished: Broken pipe >>>>> >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x76b5a0: Broken pipe >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup >>>>> finished: Broken pipe >>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred >>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on >>>>> 0x778e00: Broken pipe >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 >>>>> operations, will clean up. >>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup >>>>> finished: Broken pipe >>>>> >>>>> Vincenzo >>>>> >>>>> On 17 June 2011 13:29, Michael Moore <[email protected]> wrote: >>>>> >>>>>> Hi Vincenzo, >>>>>> >>>>>> This sounds similar to an issue just reported by Benjamin Seevers here >>>>>> on the developers list: >>>>>> >>>>>> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html >>>>>> >>>>>> Based on his experience with the issue if you switch to alt-aio >>>>>> instead of directio the corruption no longer occurs. Could you try >>>>>> switching >>>>>> from directio to alt-aio in your configuration to help isolate if this >>>>>> is a >>>>>> similar or different issue? If that doesn't resolve the issue, could you >>>>>> provide what errors, if any, you see on the client when it fails and what >>>>>> errors, if any, appear in the pvfs2-server logs? >>>>>> >>>>>> Thanks, >>>>>> Michael >>>>>> >>>>>> On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I'm using the following setup: >>>>>>> 4 machines used as I/O server >>>>>>> 10 machines used as I/O client >>>>>>> >>>>>>> The configuration file is the following: >>>>>>> >>>>>>> <Defaults> >>>>>>> UnexpectedRequests 50 >>>>>>> EventLogging none >>>>>>> EnableTracing no >>>>>>> LogStamp datetime >>>>>>> BMIModules bmi_tcp >>>>>>> FlowModules flowproto_multiqueue >>>>>>> PerfUpdateInterval 1000 >>>>>>> ServerJobBMITimeoutSecs 30 >>>>>>> ServerJobFlowTimeoutSecs 30 >>>>>>> ClientJobBMITimeoutSecs 300 >>>>>>> ClientJobFlowTimeoutSecs 300 >>>>>>> ClientRetryLimit 5 >>>>>>> ClientRetryDelayMilliSecs 2000 >>>>>>> PrecreateBatchSize 512 >>>>>>> PrecreateLowThreshold 256 >>>>>>> TCPBufferSend 524288 >>>>>>> TCPBufferReceive 524288 >>>>>>> StorageSpace /local/vincenzo/pvfs2-storage-space >>>>>>> LogFile /tmp/pvfs2-server.log >>>>>>> </Defaults> >>>>>>> >>>>>>> <Aliases> >>>>>>> Alias blade58 tcp://blade58:3334 >>>>>>> Alias blade59 tcp://blade59:3334 >>>>>>> Alias blade60 tcp://blade60:3334 >>>>>>> Alias blade61 tcp://blade61:3334 >>>>>>> </Aliases> >>>>>>> >>>>>>> <Filesystem> >>>>>>> Name pvfs2-fs >>>>>>> ID 1615492168 >>>>>>> RootHandle 1048576 >>>>>>> FileStuffing yes >>>>>>> <MetaHandleRanges> >>>>>>> Range blade58 3-1152921504606846977 >>>>>>> Range blade59 1152921504606846978-2305843009213693952 >>>>>>> Range blade60 2305843009213693953-3458764513820540927 >>>>>>> Range blade61 3458764513820540928-4611686018427387902 >>>>>>> </MetaHandleRanges> >>>>>>> <DataHandleRanges> >>>>>>> Range blade58 4611686018427387903-5764607523034234877 >>>>>>> Range blade59 5764607523034234878-6917529027641081852 >>>>>>> Range blade60 6917529027641081853-8070450532247928827 >>>>>>> Range blade61 8070450532247928828-9223372036854775802 >>>>>>> </DataHandleRanges> >>>>>>> <StorageHints> >>>>>>> TroveSyncMeta no >>>>>>> TroveSyncData no >>>>>>> TroveMethod directio >>>>>>> </StorageHints> >>>>>>> </Filesystem> >>>>>>> >>>>>>> I'm testing the system writing (continuously) from 1 client machine >>>>>>> chunks of 500K. After few seconds, the client is not able to write. >>>>>>> Checking >>>>>>> manually the file system, I can see my file (running ls) and it seems >>>>>>> to be >>>>>>> corrupted (no information about the file is given and I cannot remove >>>>>>> the >>>>>>> file). The only solution is to stop all clients / servers and re-create >>>>>>> the >>>>>>> file system. >>>>>>> >>>>>>> Thanks in advance >>>>>>> >>>>>>> Vincenzo >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pvfs2-users mailing list >>>>>>> [email protected] >>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Pvfs2-users mailing >>> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >>> >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >> >> >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
