It's an ubuntu server, 2.6.24-24-server 64 bits
pvfs2 is 2.8.2

I've 1 client that loops calling syscall(SYS_io_submit,...


On 17 June 2011 14:02, Michael Moore <[email protected]> wrote:

> What version of OrangeFS/PVFS and what distro/kernel version is used in the
> setup? To re-create it, just a stream of simple write() calls from a single
> client or something more involved?
>
> Thanks,
> Michael
>
>
> On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano <
> [email protected]> wrote:
>
>> Thanks Michael
>>
>> I've tried setting alt-aio as TroveMethod and the problem is still there.
>>
>> Some logs:
>>
>> Client (blade39) says:
>>
>>  [E 13:36:19.590763] server: tcp://blade60:3334
>> [E 13:36:19.591006] io_process_context_recv (op_status): No such file or
>> directory
>> [E 13:36:19.591018] server: tcp://blade61:3334
>> [E 13:36:19.768105] io_process_context_recv (op_status): No such file or
>> directory
>>
>> Servers:
>>
>> blade58:
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x7f5cac004370: Connection reset by peer
>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error cleanup
>> finished: Connection reset by peer
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x7f5cac0ee8f0: Connection reset by peer
>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error cleanup
>> finished: Connection reset by peer
>>
>> blade59:
>> [E 06/17 13:37] trove_write_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x799410: Broken pipe
>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup
>> finished: Broken pipe
>>
>> blade60:
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x7fb0a012bed0: Connection reset by peer
>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error cleanup
>> finished: Connection reset by peer
>>
>> blade61:
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x76b5a0: Broken pipe
>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
>> finished: Broken pipe
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x778e00: Broken pipe
>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
>> finished: Broken pipe
>>
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x76b5a0: Broken pipe
>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
>> finished: Broken pipe
>> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
>> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
>> 0x778e00: Broken pipe
>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0
>> operations, will clean up.
>> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
>> finished: Broken pipe
>>
>> Vincenzo
>>
>> On 17 June 2011 13:29, Michael Moore <[email protected]> wrote:
>>
>>> Hi Vincenzo,
>>>
>>> This sounds similar to an issue just reported by Benjamin Seevers here on
>>> the developers list:
>>>
>>> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html
>>>
>>> Based on his experience with the issue if you switch to alt-aio instead
>>> of directio the corruption no longer occurs. Could you try switching from
>>> directio to alt-aio in your configuration to help isolate if this is a
>>> similar or different issue? If that doesn't resolve the issue, could you
>>> provide what errors, if any, you see on the client when it fails and what
>>> errors, if any, appear in the pvfs2-server logs?
>>>
>>> Thanks,
>>> Michael
>>>
>>> On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>> I'm using the following setup:
>>>> 4 machines used as I/O server
>>>> 10 machines used as I/O client
>>>>
>>>> The configuration file is the following:
>>>>
>>>> <Defaults>
>>>>  UnexpectedRequests 50
>>>> EventLogging none
>>>> EnableTracing no
>>>>  LogStamp datetime
>>>> BMIModules bmi_tcp
>>>> FlowModules flowproto_multiqueue
>>>>  PerfUpdateInterval 1000
>>>> ServerJobBMITimeoutSecs 30
>>>> ServerJobFlowTimeoutSecs 30
>>>>  ClientJobBMITimeoutSecs 300
>>>> ClientJobFlowTimeoutSecs 300
>>>> ClientRetryLimit 5
>>>>  ClientRetryDelayMilliSecs 2000
>>>> PrecreateBatchSize 512
>>>> PrecreateLowThreshold 256
>>>>  TCPBufferSend 524288
>>>> TCPBufferReceive 524288
>>>> StorageSpace /local/vincenzo/pvfs2-storage-space
>>>>  LogFile /tmp/pvfs2-server.log
>>>> </Defaults>
>>>>
>>>> <Aliases>
>>>> Alias blade58 tcp://blade58:3334
>>>>  Alias blade59 tcp://blade59:3334
>>>> Alias blade60 tcp://blade60:3334
>>>> Alias blade61 tcp://blade61:3334
>>>> </Aliases>
>>>>
>>>> <Filesystem>
>>>> Name pvfs2-fs
>>>> ID 1615492168
>>>>  RootHandle 1048576
>>>> FileStuffing yes
>>>> <MetaHandleRanges>
>>>>  Range blade58 3-1152921504606846977
>>>> Range blade59 1152921504606846978-2305843009213693952
>>>>  Range blade60 2305843009213693953-3458764513820540927
>>>> Range blade61 3458764513820540928-4611686018427387902
>>>>  </MetaHandleRanges>
>>>> <DataHandleRanges>
>>>> Range blade58 4611686018427387903-5764607523034234877
>>>>  Range blade59 5764607523034234878-6917529027641081852
>>>> Range blade60 6917529027641081853-8070450532247928827
>>>>  Range blade61 8070450532247928828-9223372036854775802
>>>> </DataHandleRanges>
>>>>  <StorageHints>
>>>> TroveSyncMeta no
>>>> TroveSyncData no
>>>>  TroveMethod directio
>>>> </StorageHints>
>>>> </Filesystem>
>>>>
>>>> I'm testing the system writing (continuously) from 1 client machine
>>>> chunks of 500K. After few seconds, the client is not able to write. 
>>>> Checking
>>>> manually the file system, I can see my file (running ls) and it seems to be
>>>> corrupted (no information about the file is given and I cannot remove the
>>>> file). The only solution is to stop all clients / servers and re-create the
>>>> file system.
>>>>
>>>> Thanks in advance
>>>>
>>>> Vincenzo
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>>
>>>
>>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to