Hi Christopher,

I think the root error message here is that the server got a "no such file or directory" error while it was trying to write data. From there it cancelled its current I/O operation, which in turn made the client time out, which probably reset communication and caused the broken pipe messages.

Do you know what sort of workload was occurring when this happened? Is it possible that a file was deleted while a process was still writing to it?

The dns round-robin should work just fine for your fstab.

thanks,
-Phil

On 11/03/2010 12:27 PM, Christopher Coffey wrote:
 Hello,

I'm concerned with some errors I'm seeing in the logs on one of our storage nodes and one client node. This is on a freshly built pvfs2 fs.

[D 11/02 11:39] PVFS2 Server version 2.8.2 starting.
[E 11/02 14:53] trove_write_callback_fn: I/O error occurred
[E 11/02 14:53] handle_io_error: flow proto error cleanup started on 0x5535420: No such file or directory [E 11/02 14:53] handle_io_error: flow proto 0x5535420 canceled 0 operations, will clean up. [E 11/02 14:53] handle_io_error: flow proto 0x5535420 error cleanup finished: No such file or directory
[E 11/02 14:58] trove_read_callback_fn: I/O error occurred
[E 11/02 14:58] handle_io_error: flow proto error cleanup started on 0x55b6330: Broken pipe [E 11/02 14:58] handle_io_error: flow proto 0x55b6330 canceled 0 operations, will clean up. [E 11/02 14:58] handle_io_error: flow proto 0x55b6330 error cleanup finished: Broken pipe


[E 11:43:15.488743] PVFS Client Daemon Started.  Version 2.8.2
[D 11:43:15.488941] [INFO]: Mapping pointer 0x2abd7a0cc000 for I/O.
[D 11:43:15.495858] [INFO]: Mapping pointer 0x7878000 for I/O.
[E 14:58:10.085848] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 59797618.
[E 14:58:10.085938] bmi_to_mem_callback_fn: I/O error occurred
[E 14:58:10.085949] handle_io_error: flow proto error cleanup started on 0x883d528: Operation cancelled (possibly due to timeout) [E 14:58:10.085956] handle_io_error: flow proto 0x883d528 canceled 0 operations, will clean up. [E 14:58:10.085963] handle_io_error: flow proto 0x883d528 error cleanup finished: Operation cancelled (possibly due to timeout) [E 14:58:10.085972] io_datafile_complete_operations: flow failed, retrying from msgpair


Configuration and other info:

> pvfs2-statfs -m /pvfs2

aggregate statistics:
---------------------------------------

        fs_id: 1713165884
        total number of servers (meta and I/O): 3
        handles available (meta and I/O):       9223372036854771237
        handles total (meta and I/O):           9223372036854775800
        bytes available:                        11107855478784
        bytes total:                            11249595187200

NOTE: The aggregate total and available statistics are calculated based
on an algorithm that assumes data will be distributed evenly; thus
the free space is equal to the smallest I/O server capacity
multiplied by the number of I/O servers.  If this number seems
unusually small, then check the individual server statistics below
to look for problematic servers.

meta server statistics:
---------------------------------------

server: tcp://sn1.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 48209920
        uptime (seconds) : 171290
        load averages    : 12480 27808 24480
        handles available: 3074457345618257075
        handles total    : 3074457345618258600
        bytes available  : 3702620225536
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data

server: tcp://sn2.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 48979968
        uptime (seconds) : 169733
        load averages    : 39328 33984 26464
        handles available: 3074457345618257081
        handles total    : 3074457345618258600
        bytes available  : 3702620045312
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data

server: tcp://sn3.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 46600192
        uptime (seconds) : 171290
        load averages    : 24352 26560 21920
        handles available: 3074457345618257081
        handles total    : 3074457345618258600
        bytes available  : 3702618492928
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data


I/O server statistics:
---------------------------------------

server: tcp://sn1.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 48209920
        uptime (seconds) : 171290
        load averages    : 12480 27808 24480
        handles available: 3074457345618257075
        handles total    : 3074457345618258600
        bytes available  : 3702620225536
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data

server: tcp://sn2.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 48979968
        uptime (seconds) : 169733
        load averages    : 39328 33984 26464
        handles available: 3074457345618257081
        handles total    : 3074457345618258600
        bytes available  : 3702620045312
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data

server: tcp://sn3.ib:3334
        RAM bytes total  : 8365256704
        RAM bytes free   : 46600192
        uptime (seconds) : 171290
        load averages    : 24352 26560 21920
        handles available: 3074457345618257081
        handles total    : 3074457345618258600
        bytes available  : 3702618492928
        bytes total      : 3749865062400
        mode: serving both metadata and I/O data


Environment information:

- 3 storage nodes, I/O, and metadata roles
- 4 clients
- the storage nodes mount their storage from some raid boxes on a SAN

The pvfs communication is tcp on infiniband. One thing I did that may or may not be an issue was to setup DNS round robin for the storage nodes access from the clients. So each client has a line like this in their /etc/fstab:

tcp://pvfsnsd.ib:3334/pvfs2-fs /pvfs2 pvfs2 defaults,noauto,intr 0 0

So the requests in theory should be balanced to all of the 3 storage nodes.

Let me know if you need additional information, thank you.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to