Dear Phil,
That fixed it - I successfully ran a 64 node PBS job with a built on the
fly PVFS filesystem & ran a 64 node iozone job against the filesystem
The test dir was set up with 16K strips on each node for a 1024K stripe.
iozone set up to write in 1024K blocks. The job took about an hour and
five minutes to run...
Initial write 668,321.05 KB/sec
Rewrite 677,999.64 KB/sec
Read 300,937.30 KB/sec
Re-read 313,756.56 KB/sec
3 of the 64 nodes showed some timeouts in their server longs from the
job, but I believe these are non fatal, e.g.:
###############################################
hostname: c17n29.ccr.buffalo.edu
###############################################
killing the PVFS2 server for PBS job: 1093826.bono.ccr.buffalo.edu
-----------------------------------------------
Server log file contents:
-----------------------------------------------
[E 10/24 17:25] job_time_mgr_expire: job time out: cancelling flow
operation, jo
b_id: 3679067.
[E 10/24 17:25] job_time_mgr_expire: job time out: cancelling flow
operation, jo
b_id: 3680783.
[E 10/24 17:25] fp_multiqueue_cancel: flow proto cancel called on
0x2a957f7b80
[E 10/24 17:25] handle_io_error: flow proto error cleanup started on
0x2a957f7b8
0: Operation cancelled (possibly due to timeout)
[E 10/24 17:25] handle_io_error: flow proto 0x2a957f7b80 canceled 1
operations,
will clean up.
[E 10/24 17:25] handle_io_error: flow proto 0x2a957f7b80 error cleanup
finished:
Operation cancelled (possibly due to timeout)
[E 10/24 17:42] job_time_mgr_expire: job time out: cancelling flow
operation, jo
b_id: 4959046.
[E 10/24 18:04]
PVFS2 server got signal 15 (server_status_flag: 262143)
-----------------------------------------------
I haven't tried setting <StorageHints> TroveMethod alt-aio
or increasing "ServerJobFlowTimeoutSecs" yet as you suggested to Brian yet.
In the latter case, the pvfs2-genconfig option: --server-job-timeout 300
sets both "ServerJobFlowTimeoutSecs" and "ServerJobBMITimeoutSecs" to 300
presumably this is what you want ?
Incidentally, what does the "alt-io" option do?
Thanks Much,
Tony
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910 Home: (716) 874-2126
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
Phil Carns wrote:
Whoops, thanks for catching that. This additional patch should fix it.
thanks,
-Phil
Tony Kew wrote:
Dear Phil,
The patch works under Red Hat Enterprise Linux 5.2, but not under
RHEL 4 update 5, which doesn't have DB_BUFFER_SMALL in
/usr/include/db4/db.h
[...]
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910 Home: (716) 874-2126
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
Tony Kew wrote:
Dear Phil,
The patch looks good - I can set a 64 node config now:
e.g.
ramones$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" test
ramones$ setfattr -n "user.pvfs2.dist_params" -v
"strips:0:16K;1:16K;2:16K;3:16K;4:16K;5:16K;6:16K;7:16K;8:16K;9:16K;10:16K;11:16K;12:16K;13:16K;14:16K;15:16K;16:16K;17:16K;18:16K;19:16K;20:16K;21:16K;22:16K;23:16K;24:16K;25:16K;26:16K;27:16K;28:16K;29:16K;30:16K;31:16K;32:16K;33:16K;34:16K;35:16K;36:16K;37:16K;38:16K;39:16K;40:16K;41:16K;42:16K;43:16K;44:16K;45:16K;46:16K;47:16K;48:16K;49:16K;50:16K;51:16K;52:16K;53:16K;54:16K;55:16K;56:16K;57:16K;58:16K;59:16K;60:16K;61:16K;62:16K;63:16K"
test
ramones$ getfattr -n "user.pvfs2.dist_params" test
# file: test
user.pvfs2.dist_params="strips:0:16K;1:16K;2:16K;3:16K;4:16K;5:16K;6:16K;7:16K;8:16K;9:16K;10:16K;11:16K;12:16K;13:16K;14:16K;15:16K;16:16K;17:16K;18:16K;19:16K;20:16K;21:16K;22:16K;23:16K;24:16K;25:16K;26:16K;27:16K;28:16K;29:16K;30:16K;31:16K;32:16K;33:16K;34:16K;35:16K;36:16K;37:16K;38:16K;39:16K;40:16K;41:16K;42:16K;43:16K;44:16K;45:16K;46:16K;47:16K;48:16K;49:16K;50:16K;51:16K;52:16K;53:16K;54:16K;55:16K;56:16K;57:16K;58:16K;59:16K;60:16K;61:16K;62:16K;63:16K"
ramones$
It may take a little while till I can install this on the cluster &
test PVFSv2
over 64 nodes, but at least the parameter can be set :-)
Thanks,
Tony
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910 Home: (716) 874-2126
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
[...]
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users