It appears the call to copy_to_user_page() (line 1516 in
src/kernel/linux-2.6/pvfs2-bufmap.c) is the culprit. I don't know yet if
it's a matter of incorrect usage/locking/cache management or bad arguments.

Michael

On Mon, Jun 20, 2011 at 8:04 PM, Michael Moore <[email protected]> wrote:

> I believe it returns ENOSYS if the calls aren't defined. I agree if we
> can't get a fix, it is better to disable it. However, the turn-around time
> on getting it off in a released version, that version being in use, is
> hopefully longer than the resolution time of the issue.
>
> I've been doing some digging into the issue. First, the one time I ran the
> example code from the man page with AIO_MAXIO == 2 (two reads at once)
> succeeds, greater than two and it fails. The kernel panics are all over the
> place, in code that can't be the problem (scsi, ext3, etc). Which makes me
> think we doing something really bad(tm). I've attached a handful of panics
> I've collected while looking at this. I think the issue is in
> pvfs_bufmap_copy_to_user_task_iovec(). It makes sense in the case that
> reads, not writes, cause the panics. However, not sure what the issue is
> yet. Any insight or recommendations appreciated.
>
> Michael
>
>
> On Fri, Jun 17, 2011 at 10:23 AM, Phil Carns <[email protected]> wrote:
>
>> **
>> What happens if a file system simply doesn't provide the aio functions
>> (ie, leaves aio_write, aio_read, etc. set to NULL in the file_operations
>> structure)?  I wonder if the aio system calls return ENOSYS or if the kernel
>> just services them as blocking calls.
>>
>> At any rate it seems like it might be a good idea to turn off that
>> functionality until the bug is fixed so that folks don't get caught off
>> guard.
>>
>> -Phil
>>
>>
>> On 06/17/2011 10:07 AM, Michael Moore wrote:
>>
>> Good memory, Phil!
>>
>> Vincenzo, you are welcome to try and upgrade to OrangeFS however I don't
>> suspect it will do too much good. Let me get this on our list and take a
>> look at it.
>>
>> Michael
>>
>>  On Fri, Jun 17, 2011 at 9:54 AM, Phil Carns <[email protected]> wrote:
>>
>>>  I think there must be a problem with the client (kernel) side aio
>>> support in PVFS.  There is a related bug report from a while back:
>>>
>>>
>>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2010-February/003045.html
>>>
>>> The libaio library described in that bug report uses the io_submit()
>>> system call as well.
>>>
>>> -Phil
>>>
>>>
>>> On 06/17/2011 08:20 AM, Vincenzo Gulisano wrote:
>>>
>>> It's an ubuntu server, 2.6.24-24-server 64 bits
>>> pvfs2 is 2.8.2
>>>
>>>  I've 1 client that loops calling syscall(SYS_io_submit,...
>>>
>>>
>>> On 17 June 2011 14:02, Michael Moore <[email protected]> wrote:
>>>
>>>> What version of OrangeFS/PVFS and what distro/kernel version is used in
>>>> the setup? To re-create it, just a stream of simple write() calls from a
>>>> single client or something more involved?
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>>
>>>>  On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Michael
>>>>>
>>>>>  I've tried setting alt-aio as TroveMethod and the problem is still
>>>>> there.
>>>>>
>>>>>  Some logs:
>>>>>
>>>>>  Client (blade39) says:
>>>>>
>>>>>   [E 13:36:19.590763] server: tcp://blade60:3334
>>>>> [E 13:36:19.591006] io_process_context_recv (op_status): No such file
>>>>> or directory
>>>>> [E 13:36:19.591018] server: tcp://blade61:3334
>>>>> [E 13:36:19.768105] io_process_context_recv (op_status): No such file
>>>>> or directory
>>>>>
>>>>>  Servers:
>>>>>
>>>>>  blade58:
>>>>>  [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x7f5cac004370: Connection reset by peer
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error
>>>>> cleanup finished: Connection reset by peer
>>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x7f5cac0ee8f0: Connection reset by peer
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error
>>>>> cleanup finished: Connection reset by peer
>>>>>
>>>>>  blade59:
>>>>>  [E 06/17 13:37] trove_write_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x799410: Broken pipe
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup
>>>>> finished: Broken pipe
>>>>>
>>>>>  blade60:
>>>>>  [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x7fb0a012bed0: Connection reset by peer
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error
>>>>> cleanup finished: Connection reset by peer
>>>>>
>>>>>  blade61:
>>>>>  [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x76b5a0: Broken pipe
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
>>>>> finished: Broken pipe
>>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x778e00: Broken pipe
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
>>>>> finished: Broken pipe
>>>>>
>>>>>  [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x76b5a0: Broken pipe
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
>>>>> finished: Broken pipe
>>>>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>>>>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>>>>> 0x778e00: Broken pipe
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0
>>>>> operations, will clean up.
>>>>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
>>>>> finished: Broken pipe
>>>>>
>>>>> Vincenzo
>>>>>
>>>>> On 17 June 2011 13:29, Michael Moore <[email protected]> wrote:
>>>>>
>>>>>> Hi Vincenzo,
>>>>>>
>>>>>> This sounds similar to an issue just reported by Benjamin Seevers here
>>>>>> on the developers list:
>>>>>>
>>>>>> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html
>>>>>>
>>>>>> Based on his experience with the issue if you switch to alt-aio
>>>>>> instead of directio the corruption no longer occurs. Could you try 
>>>>>> switching
>>>>>> from directio to alt-aio in your configuration to help isolate if this 
>>>>>> is a
>>>>>> similar or different issue? If that doesn't resolve the issue, could you
>>>>>> provide what errors, if any, you see on the client when it fails and what
>>>>>> errors, if any, appear in the pvfs2-server logs?
>>>>>>
>>>>>> Thanks,
>>>>>> Michael
>>>>>>
>>>>>>  On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>  Hi,
>>>>>>> I'm using the following setup:
>>>>>>> 4 machines used as I/O server
>>>>>>> 10 machines used as I/O client
>>>>>>>
>>>>>>>  The configuration file is the following:
>>>>>>>
>>>>>>>  <Defaults>
>>>>>>>  UnexpectedRequests 50
>>>>>>>  EventLogging none
>>>>>>>  EnableTracing no
>>>>>>>  LogStamp datetime
>>>>>>>  BMIModules bmi_tcp
>>>>>>>  FlowModules flowproto_multiqueue
>>>>>>>  PerfUpdateInterval 1000
>>>>>>>  ServerJobBMITimeoutSecs 30
>>>>>>>  ServerJobFlowTimeoutSecs 30
>>>>>>>  ClientJobBMITimeoutSecs 300
>>>>>>>  ClientJobFlowTimeoutSecs 300
>>>>>>>  ClientRetryLimit 5
>>>>>>>  ClientRetryDelayMilliSecs 2000
>>>>>>>  PrecreateBatchSize 512
>>>>>>>  PrecreateLowThreshold 256
>>>>>>>  TCPBufferSend 524288
>>>>>>>  TCPBufferReceive 524288
>>>>>>>  StorageSpace /local/vincenzo/pvfs2-storage-space
>>>>>>>  LogFile /tmp/pvfs2-server.log
>>>>>>> </Defaults>
>>>>>>>
>>>>>>>  <Aliases>
>>>>>>>  Alias blade58 tcp://blade58:3334
>>>>>>>  Alias blade59 tcp://blade59:3334
>>>>>>>  Alias blade60 tcp://blade60:3334
>>>>>>>  Alias blade61 tcp://blade61:3334
>>>>>>> </Aliases>
>>>>>>>
>>>>>>>  <Filesystem>
>>>>>>>  Name pvfs2-fs
>>>>>>>  ID 1615492168
>>>>>>>  RootHandle 1048576
>>>>>>>  FileStuffing yes
>>>>>>>  <MetaHandleRanges>
>>>>>>>  Range blade58 3-1152921504606846977
>>>>>>>  Range blade59 1152921504606846978-2305843009213693952
>>>>>>>  Range blade60 2305843009213693953-3458764513820540927
>>>>>>>  Range blade61 3458764513820540928-4611686018427387902
>>>>>>>  </MetaHandleRanges>
>>>>>>>  <DataHandleRanges>
>>>>>>>  Range blade58 4611686018427387903-5764607523034234877
>>>>>>>  Range blade59 5764607523034234878-6917529027641081852
>>>>>>>  Range blade60 6917529027641081853-8070450532247928827
>>>>>>>  Range blade61 8070450532247928828-9223372036854775802
>>>>>>>  </DataHandleRanges>
>>>>>>>  <StorageHints>
>>>>>>>  TroveSyncMeta no
>>>>>>>  TroveSyncData no
>>>>>>>  TroveMethod directio
>>>>>>>  </StorageHints>
>>>>>>> </Filesystem>
>>>>>>>
>>>>>>>  I'm testing the system writing (continuously) from 1 client machine
>>>>>>> chunks of 500K. After few seconds, the client is not able to write. 
>>>>>>> Checking
>>>>>>> manually the file system, I can see my file (running ls) and it seems 
>>>>>>> to be
>>>>>>> corrupted (no information about the file is given and I cannot remove 
>>>>>>> the
>>>>>>> file). The only solution is to stop all clients / servers and re-create 
>>>>>>> the
>>>>>>> file system.
>>>>>>>
>>>>>>>  Thanks in advance
>>>>>>>
>>>>>>>  Vincenzo
>>>>>>>
>>>>>>>  _______________________________________________
>>>>>>> Pvfs2-users mailing list
>>>>>>> [email protected]
>>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing 
>>> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>
>>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to