Thanks for the extra information, Tony. That's too bad that the
metasync option wasn't helping for your configuration.
I don't think any relevant default configuration changes since 2.7.1. I
think it is just that size update issue that I mentioned earlier since
it only really shows up in the initial write phase. We just have a
performance regression there for serial applications.
I don't have an answer for you yet, but we are looking into it.
-Phil
Tony Kew wrote:
Dear Phil,
I ran the iozone job manually four times. Only once was there any outout
in any of the server logs (after the filesystem was initially built,
servers
started and the filesystem mounted).
The one iozone run that gave errors failed after the writes had completed,
running the initial read pass:
################################################################################
Errors from node c14n29 PVFSv2 server logfile:
################################################################################
[E 03/23 12:38] job_time_mgr_expire: job time out: cancelling flow
operation, jo
b_id: 3647671.
[E 03/23 12:38] fp_multiqueue_cancel: flow proto cancel called on
0x2a9586f540
[E 03/23 12:38] fp_multiqueue_cancel: I/O error occurred
[E 03/23 12:38] handle_io_error: flow proto error cleanup started on
0x2a9586f54
0: Operation cancelled (possibly due to timeout)
[E 03/23 12:38] PVFS2 server: signal 11, faulty address is (nil), from
(nil)
[E 03/23 12:38] [bt] [(nil)]
Other than this (which I consider an anomaly for now...) The average
performance
number for the three iozone runs that completed follow:
Initial write: 37,625.48 KB/sec
Rewrite: 149,830.93 KB/sec
Read: 170,758.41 KB/sec
Re-read: 206,256.47 KB/sec
I would say there is a definite performance issue with initial writes.
Are there any filesytems configuration defaults that may have changed
perhaps?...
Thanks Much,
Tony
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
Tony Kew wrote:
Dear Phil,
The filesystem configuration in my tests are built as follows:
pvfs2-genconfig --quiet --protocol tcp --tcpport --notrovesync
--trove-method "alt-aio" \
--server-job-timeout 60" --fsid=_a_job_specific_id
--fsname=_a_job_specific_name_ \
--storage _a_job_specific_storage_space_ --logfile
_a_job_specific_logfile_ \
--ioservers [list of nodes in the job] --metaservers [list of nodes in
the job]
I believe the "--notrovesync" option already sets "TroveSyncMeta no"
in the config
file.
I'm running an interactive PBS job to make sure the "msgpair failed"
error messages
are generated during the filesystem build, and not sebsequently -
certainly it
appears to be the case, but I'm running an iozone job manually to be
sure...
Tony
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
Phil Carns wrote:
Hi Tony,
This is most likely due to a change in how PVFS 2.8.x tracks file
size during writes beyond EOF. It now stores file size explicitly in
berkeley db for each data file. This is required for the new
directio storage method, but we applied it to the other methods as
well to simplify compatibility.
A test that you could run to confirm this would be to change your
PVFS server configuration file to this in the StorageHints section:
TroveSyncMeta no
With that set to "no", PVFS will still synchronize metadata
(including the explicit size field), but it may delay synchronization
until after an acknowledgement has been sent to the client. This
will probably hide the size update cost for a serial application.
The size update overhead will only show up for serialized
applications that issue small writes beyond EOF (like iozone in the
"initial write" phase). If it were a parallel application, PVFS
would coalesce the size updates to reduce overhead. If it were a
serial application that used larger writes, the size update cost
would be amortized over a longer period of time.
Regarding your log file warnings, those are normal. In 2.8.x the
servers communicate with each other on startup to precreate datafile
objects. It issues those warnings on occasion if one or more servers
is not up and running yet when it tries to do that, but it will stop
as soon as all servers are available.
thanks,
-Phil
Tony Kew wrote:
Dear Phil,
Irrespective of the --enable-mmap-racache option, there does seem to be
a marked performance drop between PVFS version 2.7.1 (with the 20 or
so patches along the way - not that any were for performance insofar as
I am aware) and version 2.8.x
version 2.8.1 built was with the following configure options:
./configure --prefix=/usr \
--libdir=/usr/lib64 \
--enable-perf-counters \
--enable-fast \
--with-kernel=%{_kernelsrcdir} \
--enable-shared \
version 2.7.1 (fully patched) configured as above, with the addition
of the --enable-mmap-racache option.
I ran three iozone tests for each of the tested distributions, using
a PBS batch job that creates a (new) filesystem across all the nodes
in the job. The iozone job runs a parallel iozone job, with one
data stream on each node. The test directory is configured as
a stripe across all the nodes, so each node is writing to all the other
nodes during the test.
The average I/O numbers from running three iozone runs,
writing to a directory configured using the "Simple Stripe"
ditribution:
v2.7.1:
Initial write: 219,306.19 KB/sec
Rewrite: 130,799.13 KB/sec
Read: 183,249.66 KB/sec
Re-read: 191,565.02 KB/sec
v2.8.1:
Initial write: 40,381.42 KB/sec
Rewrite: 132,908.15 KB/sec
Read: 203,758.06 KB/sec
Re-read: 276,100.11 KB/sec
For a TwoD Stipe distribution:
v2.7.1:
Initial write: 343,876.68 KB/sec
Rewrite: 229,740.04 KB/sec
Read: 167,045.91 KB/sec
Re-read: 166,417.03 KB/sec
v2.8.1:
Initial write: 140,253.67 KB/sec
Rewrite: 201,923.75 KB/sec
Read: 182,109.70 KB/sec
Re-read: 205,073.70 KB/sec
KB/sec
In the server log files for the v2.8.1 runs there are many of these:
[E 03/06 16:26] Warning: msgpair failed to tcp://c14n24:3334, will
retry: Connection refused
...but only at the time when the filesystem is created, so I don't
believe they have any bearing on the test results.
Let me know if I can provide any more info, or if further tests
would be of use....
Many Thanks,
Tony
Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203
CoE Office: (716) 881-8930 Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
Cell: (716) 560-0910
"I love deadlines, I love the whooshing noise they make as they go by."
Douglas Adams
[...]
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users