Hi Christopher,
I think the root error message here is that the server got a "no such
file or directory" error while it was trying to write data. From there
it cancelled its current I/O operation, which in turn made the client
time out, which probably reset communication and caused the broken pipe
messages.
Do you know what sort of workload was occurring when this happened? Is
it possible that a file was deleted while a process was still writing to it?
The dns round-robin should work just fine for your fstab.
thanks,
-Phil
On 11/03/2010 12:27 PM, Christopher Coffey wrote:
Hello,
I'm concerned with some errors I'm seeing in the logs on one of our
storage nodes and one client node. This is on a freshly built pvfs2 fs.
[D 11/02 11:39] PVFS2 Server version 2.8.2 starting.
[E 11/02 14:53] trove_write_callback_fn: I/O error occurred
[E 11/02 14:53] handle_io_error: flow proto error cleanup started on
0x5535420: No such file or directory
[E 11/02 14:53] handle_io_error: flow proto 0x5535420 canceled 0
operations, will clean up.
[E 11/02 14:53] handle_io_error: flow proto 0x5535420 error cleanup
finished: No such file or directory
[E 11/02 14:58] trove_read_callback_fn: I/O error occurred
[E 11/02 14:58] handle_io_error: flow proto error cleanup started on
0x55b6330: Broken pipe
[E 11/02 14:58] handle_io_error: flow proto 0x55b6330 canceled 0
operations, will clean up.
[E 11/02 14:58] handle_io_error: flow proto 0x55b6330 error cleanup
finished: Broken pipe
[E 11:43:15.488743] PVFS Client Daemon Started. Version 2.8.2
[D 11:43:15.488941] [INFO]: Mapping pointer 0x2abd7a0cc000 for I/O.
[D 11:43:15.495858] [INFO]: Mapping pointer 0x7878000 for I/O.
[E 14:58:10.085848] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 59797618.
[E 14:58:10.085938] bmi_to_mem_callback_fn: I/O error occurred
[E 14:58:10.085949] handle_io_error: flow proto error cleanup started
on 0x883d528: Operation cancelled (possibly due to timeout)
[E 14:58:10.085956] handle_io_error: flow proto 0x883d528 canceled 0
operations, will clean up.
[E 14:58:10.085963] handle_io_error: flow proto 0x883d528 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 14:58:10.085972] io_datafile_complete_operations: flow failed,
retrying from msgpair
Configuration and other info:
> pvfs2-statfs -m /pvfs2
aggregate statistics:
---------------------------------------
fs_id: 1713165884
total number of servers (meta and I/O): 3
handles available (meta and I/O): 9223372036854771237
handles total (meta and I/O): 9223372036854775800
bytes available: 11107855478784
bytes total: 11249595187200
NOTE: The aggregate total and available statistics are calculated based
on an algorithm that assumes data will be distributed evenly; thus
the free space is equal to the smallest I/O server capacity
multiplied by the number of I/O servers. If this number seems
unusually small, then check the individual server statistics below
to look for problematic servers.
meta server statistics:
---------------------------------------
server: tcp://sn1.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 48209920
uptime (seconds) : 171290
load averages : 12480 27808 24480
handles available: 3074457345618257075
handles total : 3074457345618258600
bytes available : 3702620225536
bytes total : 3749865062400
mode: serving both metadata and I/O data
server: tcp://sn2.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 48979968
uptime (seconds) : 169733
load averages : 39328 33984 26464
handles available: 3074457345618257081
handles total : 3074457345618258600
bytes available : 3702620045312
bytes total : 3749865062400
mode: serving both metadata and I/O data
server: tcp://sn3.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 46600192
uptime (seconds) : 171290
load averages : 24352 26560 21920
handles available: 3074457345618257081
handles total : 3074457345618258600
bytes available : 3702618492928
bytes total : 3749865062400
mode: serving both metadata and I/O data
I/O server statistics:
---------------------------------------
server: tcp://sn1.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 48209920
uptime (seconds) : 171290
load averages : 12480 27808 24480
handles available: 3074457345618257075
handles total : 3074457345618258600
bytes available : 3702620225536
bytes total : 3749865062400
mode: serving both metadata and I/O data
server: tcp://sn2.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 48979968
uptime (seconds) : 169733
load averages : 39328 33984 26464
handles available: 3074457345618257081
handles total : 3074457345618258600
bytes available : 3702620045312
bytes total : 3749865062400
mode: serving both metadata and I/O data
server: tcp://sn3.ib:3334
RAM bytes total : 8365256704
RAM bytes free : 46600192
uptime (seconds) : 171290
load averages : 24352 26560 21920
handles available: 3074457345618257081
handles total : 3074457345618258600
bytes available : 3702618492928
bytes total : 3749865062400
mode: serving both metadata and I/O data
Environment information:
- 3 storage nodes, I/O, and metadata roles
- 4 clients
- the storage nodes mount their storage from some raid boxes on a SAN
The pvfs communication is tcp on infiniband. One thing I did that may
or may not be an issue was to setup DNS round robin for the storage
nodes access from the clients. So each client has a line like this in
their /etc/fstab:
tcp://pvfsnsd.ib:3334/pvfs2-fs /pvfs2 pvfs2 defaults,noauto,intr 0 0
So the requests in theory should be balanced to all of the 3 storage
nodes.
Let me know if you need additional information, thank you.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users