Thanks for the extra information, Tony. That's too bad that the metasync option wasn't helping for your configuration.

I don't think any relevant default configuration changes since 2.7.1. I think it is just that size update issue that I mentioned earlier since it only really shows up in the initial write phase. We just have a performance regression there for serial applications.

I don't have an answer for you yet, but we are looking into it.

-Phil

Tony Kew wrote:
Dear Phil,

I ran the iozone job manually four times.  Only once was there any outout
in any of the server logs (after the filesystem was initially built, servers
started and the filesystem mounted).

The one iozone run that gave errors failed after the writes had completed,
running the initial read pass:

################################################################################
Errors from node c14n29 PVFSv2 server logfile:
################################################################################ [E 03/23 12:38] job_time_mgr_expire: job time out: cancelling flow operation, jo
b_id: 3647671.
[E 03/23 12:38] fp_multiqueue_cancel: flow proto cancel called on 0x2a9586f540
[E 03/23 12:38] fp_multiqueue_cancel: I/O error occurred
[E 03/23 12:38] handle_io_error: flow proto error cleanup started on 0x2a9586f54
0: Operation cancelled (possibly due to timeout)
[E 03/23 12:38] PVFS2 server: signal 11, faulty address is (nil), from (nil)
[E 03/23 12:38] [bt] [(nil)]


Other than this (which I consider an anomaly for now...) The average performance
number for the three iozone runs that completed follow:

     Initial write:  37,625.48 KB/sec
           Rewrite: 149,830.93 KB/sec
              Read: 170,758.41 KB/sec
           Re-read: 206,256.47 KB/sec

I would say there is a definite performance issue with initial writes.

Are there any filesytems configuration defaults that may have changed
perhaps?...

Thanks Much,
Tony

Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203

CoE Office: (716) 881-8930          Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
     Cell: (716) 560-0910

"I love deadlines, I love the whooshing noise they make as they go by."
                                                         Douglas Adams



Tony Kew wrote:
Dear Phil,

The filesystem configuration in my tests are built as follows:

pvfs2-genconfig --quiet --protocol tcp --tcpport --notrovesync --trove-method "alt-aio" \ --server-job-timeout 60" --fsid=_a_job_specific_id --fsname=_a_job_specific_name_ \ --storage _a_job_specific_storage_space_ --logfile _a_job_specific_logfile_ \ --ioservers [list of nodes in the job] --metaservers [list of nodes in the job]

I believe the "--notrovesync" option already sets "TroveSyncMeta no" in the config
file.

I'm running an interactive PBS job to make sure the "msgpair failed" error messages are generated during the filesystem build, and not sebsequently - certainly it appears to be the case, but I'm running an iozone job manually to be sure...

Tony


Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203

CoE Office: (716) 881-8930          Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
     Cell: (716) 560-0910

"I love deadlines, I love the whooshing noise they make as they go by."
                                                         Douglas Adams



Phil Carns wrote:
Hi Tony,

This is most likely due to a change in how PVFS 2.8.x tracks file size during writes beyond EOF. It now stores file size explicitly in berkeley db for each data file. This is required for the new directio storage method, but we applied it to the other methods as well to simplify compatibility.

A test that you could run to confirm this would be to change your PVFS server configuration file to this in the StorageHints section:

TroveSyncMeta no

With that set to "no", PVFS will still synchronize metadata (including the explicit size field), but it may delay synchronization until after an acknowledgement has been sent to the client. This will probably hide the size update cost for a serial application.

The size update overhead will only show up for serialized applications that issue small writes beyond EOF (like iozone in the "initial write" phase). If it were a parallel application, PVFS would coalesce the size updates to reduce overhead. If it were a serial application that used larger writes, the size update cost would be amortized over a longer period of time.

Regarding your log file warnings, those are normal. In 2.8.x the servers communicate with each other on startup to precreate datafile objects. It issues those warnings on occasion if one or more servers is not up and running yet when it tries to do that, but it will stop as soon as all servers are available.

thanks,
-Phil


Tony Kew wrote:
Dear Phil,

Irrespective of the --enable-mmap-racache option, there does seem to be
a marked performance drop between PVFS version 2.7.1 (with the 20 or
so patches along the way - not that any were for performance insofar as
I am aware) and version 2.8.x

version 2.8.1 built was with the following configure options:
./configure --prefix=/usr \
--libdir=/usr/lib64 \
--enable-perf-counters \
--enable-fast \
--with-kernel=%{_kernelsrcdir} \
--enable-shared \

version 2.7.1 (fully patched) configured as above, with the addition
of the --enable-mmap-racache option.

I ran three iozone tests for each of the tested distributions, using
a PBS batch job that creates a (new) filesystem across all the nodes
in the job.  The iozone job runs a parallel iozone job, with one
data stream on each node.  The test directory is configured as
a stripe across all the nodes, so each node is writing to all the other
nodes during the test.

The average I/O numbers from running three iozone runs,
writing to a directory configured using the "Simple Stripe" ditribution:

v2.7.1:

 Initial write: 219,306.19 KB/sec
       Rewrite: 130,799.13 KB/sec
          Read: 183,249.66 KB/sec
       Re-read: 191,565.02 KB/sec


v2.8.1:

 Initial write:  40,381.42 KB/sec
       Rewrite: 132,908.15 KB/sec
          Read: 203,758.06 KB/sec
       Re-read: 276,100.11 KB/sec

For a TwoD Stipe distribution:

v2.7.1:

 Initial write: 343,876.68 KB/sec
       Rewrite: 229,740.04 KB/sec
          Read: 167,045.91 KB/sec
       Re-read: 166,417.03 KB/sec

v2.8.1:

 Initial write: 140,253.67 KB/sec
       Rewrite: 201,923.75 KB/sec
          Read: 182,109.70 KB/sec
       Re-read: 205,073.70 KB/sec

KB/sec



In the server log files for the v2.8.1 runs there are many of these:

[E 03/06 16:26] Warning: msgpair failed to tcp://c14n24:3334, will retry: Connection refused

...but only at the time when the filesystem is created, so I don't
believe they have any bearing on the test results.

Let me know if I can provide any more info, or if further tests
would be of use....


Many Thanks,
Tony

Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203

CoE Office: (716) 881-8930           Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
     Cell: (716) 560-0910

"I love deadlines, I love the whooshing noise they make as they go by."
                                                         Douglas Adams
[...]

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to