What version of OrangeFS/PVFS and what distro/kernel version is used in the
setup? To re-create it, just a stream of simple write() calls from a single
client or something more involved?

Thanks,
Michael

On Fri, Jun 17, 2011 at 7:43 AM, Vincenzo Gulisano <
[email protected]> wrote:

> Thanks Michael
>
> I've tried setting alt-aio as TroveMethod and the problem is still there.
>
> Some logs:
>
> Client (blade39) says:
>
> [E 13:36:19.590763] server: tcp://blade60:3334
> [E 13:36:19.591006] io_process_context_recv (op_status): No such file or
> directory
> [E 13:36:19.591018] server: tcp://blade61:3334
> [E 13:36:19.768105] io_process_context_recv (op_status): No such file or
> directory
>
> Servers:
>
> blade58:
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x7f5cac004370: Connection reset by peer
> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 canceled 0
> operations, will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac004370 error cleanup
> finished: Connection reset by peer
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x7f5cac0ee8f0: Connection reset by peer
> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 canceled 0
> operations, will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x7f5cac0ee8f0 error cleanup
> finished: Connection reset by peer
>
> blade59:
> [E 06/17 13:37] trove_write_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x799410: Broken pipe
> [E 06/17 13:37] handle_io_error: flow proto 0x799410 canceled 0 operations,
> will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x799410 error cleanup
> finished: Broken pipe
>
> blade60:
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x7fb0a012bed0: Connection reset by peer
> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 canceled 0
> operations, will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x7fb0a012bed0 error cleanup
> finished: Connection reset by peer
>
> blade61:
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x76b5a0: Broken pipe
> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 operations,
> will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
> finished: Broken pipe
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x778e00: Broken pipe
> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 operations,
> will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
> finished: Broken pipe
>
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x76b5a0: Broken pipe
> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 canceled 0 operations,
> will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x76b5a0 error cleanup
> finished: Broken pipe
> [E 06/17 13:37] bmi_recv_callback_fn: I/O error occurred
> [E 06/17 13:37] handle_io_error: flow proto error cleanup started on
> 0x778e00: Broken pipe
> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 canceled 0 operations,
> will clean up.
> [E 06/17 13:37] handle_io_error: flow proto 0x778e00 error cleanup
> finished: Broken pipe
>
> Vincenzo
>
> On 17 June 2011 13:29, Michael Moore <[email protected]> wrote:
>
>> Hi Vincenzo,
>>
>> This sounds similar to an issue just reported by Benjamin Seevers here on
>> the developers list:
>>
>> http://www.beowulf-underground.org/pipermail/pvfs2-developers/2011-June/004732.html
>>
>> Based on his experience with the issue if you switch to alt-aio instead of
>> directio the corruption no longer occurs. Could you try switching from
>> directio to alt-aio in your configuration to help isolate if this is a
>> similar or different issue? If that doesn't resolve the issue, could you
>> provide what errors, if any, you see on the client when it fails and what
>> errors, if any, appear in the pvfs2-server logs?
>>
>> Thanks,
>> Michael
>>
>> On Fri, Jun 17, 2011 at 6:48 AM, Vincenzo Gulisano <
>> [email protected]> wrote:
>>
>>> Hi,
>>> I'm using the following setup:
>>> 4 machines used as I/O server
>>> 10 machines used as I/O client
>>>
>>> The configuration file is the following:
>>>
>>> <Defaults>
>>>  UnexpectedRequests 50
>>> EventLogging none
>>> EnableTracing no
>>>  LogStamp datetime
>>> BMIModules bmi_tcp
>>> FlowModules flowproto_multiqueue
>>>  PerfUpdateInterval 1000
>>> ServerJobBMITimeoutSecs 30
>>> ServerJobFlowTimeoutSecs 30
>>>  ClientJobBMITimeoutSecs 300
>>> ClientJobFlowTimeoutSecs 300
>>> ClientRetryLimit 5
>>>  ClientRetryDelayMilliSecs 2000
>>> PrecreateBatchSize 512
>>> PrecreateLowThreshold 256
>>>  TCPBufferSend 524288
>>> TCPBufferReceive 524288
>>> StorageSpace /local/vincenzo/pvfs2-storage-space
>>>  LogFile /tmp/pvfs2-server.log
>>> </Defaults>
>>>
>>> <Aliases>
>>> Alias blade58 tcp://blade58:3334
>>>  Alias blade59 tcp://blade59:3334
>>> Alias blade60 tcp://blade60:3334
>>> Alias blade61 tcp://blade61:3334
>>> </Aliases>
>>>
>>> <Filesystem>
>>> Name pvfs2-fs
>>> ID 1615492168
>>>  RootHandle 1048576
>>> FileStuffing yes
>>> <MetaHandleRanges>
>>>  Range blade58 3-1152921504606846977
>>> Range blade59 1152921504606846978-2305843009213693952
>>>  Range blade60 2305843009213693953-3458764513820540927
>>> Range blade61 3458764513820540928-4611686018427387902
>>>  </MetaHandleRanges>
>>> <DataHandleRanges>
>>> Range blade58 4611686018427387903-5764607523034234877
>>>  Range blade59 5764607523034234878-6917529027641081852
>>> Range blade60 6917529027641081853-8070450532247928827
>>>  Range blade61 8070450532247928828-9223372036854775802
>>> </DataHandleRanges>
>>>  <StorageHints>
>>> TroveSyncMeta no
>>> TroveSyncData no
>>>  TroveMethod directio
>>> </StorageHints>
>>> </Filesystem>
>>>
>>> I'm testing the system writing (continuously) from 1 client machine
>>> chunks of 500K. After few seconds, the client is not able to write. Checking
>>> manually the file system, I can see my file (running ls) and it seems to be
>>> corrupted (no information about the file is given and I cannot remove the
>>> file). The only solution is to stop all clients / servers and re-create the
>>> file system.
>>>
>>> Thanks in advance
>>>
>>> Vincenzo
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to