Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
Well, I did not plan to test all the possible versions of the kernel;
for sure improvements are on their way, what just confirms the
assumption that this 'technology' is not mature yet.

With IPoIB an NFS server can easily export (for instance) up to
1.2GB/s (at least this is what I can measure), with the data in the
page cache. No problem up to that point at least. I clearly

True ... but not so interesting to the actual data read/write case when it has to get back to spinning disk.

understand the theoretical benefits of RDMA and it's a clear
improvement over TCP, for MPI. However, the drastic change for MPI is
even more on the latency side, though the peak message bandwidth is
also improved as one might expect for NFS.

Again, true, though NFS has to walk through transport protocol layers as well as NFS application layers. This additional effort reduces performance considerably.

Add to this that you need (sadly) a copy of a buffer between the network stack and the disk stack. RDMA reduces one of these copies, but but as far as I know, it doesn't talk directly to the disks (you can do something like this with SCST in the block modes if you don't mind iSCSI).

Registration/deregistration issues are also well-known to the MPI
developpers, and all this is certainly not that easy to manage in
other areas.

Still, NFS-RDMA remains NFS. If the bottleneck is not in the
transport, nothing will be improved by RDMA from the performance
point of view. Even worse, what I saw with the 2.6.27 kernel +
OFED1.4-rc3 is the inability of NFS-RDMA to match the performance of
NFS-TCP for some patterns of IOzone, with a filesystem able to

Hmmm.... Most of the (default) IOzone measurements we have done (and seen published) are bound almost entirely by system ram cache. Indeed, we have had to go into the code and alter some of the constants to allow us to test greater than 16 MB records, and greater than 16 GB files. Otherwise all we measure is cache speed.

Could you elaborate on system parameters, and what measurements weren't up to par, as well as what options you used?

We see NFSoverRDMA on SDR achieving about 400 MB/s while NFS over IPoIB on the same hardware (identical actually) is about 200 MB/s on reads. With DDR IB, we ran a test between a pair of our JackRabbit machines, and found a sustained ~500-550 MB/s read, and about 400 MB/s or so write. The underlying file system could handle well over 1 GB/s.

NFS over IPoIB wasn't close.

sustain itself several hundreds of MB/s (using exactly the same
hardware and software in both cases). We are far from a pure IB
bandwidth issue here, we are just facing an issue in how the requests
are handled probably, perhaps when paging occurs, I can't tell. I

I don't think this is the limitation. I think it is more along the lines of copying buffers between different stacks ... kernel buffer to user space program and then back to kernel for net->ram->disk and vice-versa.

There are other issues as well which could be causing performance degradation, specifically on payload size.

FWIW:  This is a 2.6.27.5 kernel.

could not find any tuning to solve the more obvious problem, i.e. the
low bandwidth for reading, except mounting with '-o rsize=4096';
probably not what people expect, as this will have other effects.
Anyway this does improve only the sequential read bandwidth. But of
course I will repeat my tests with the latest release of everything
when I have time, still making sure I compare apples to apples... Again, I'm sure improvements are on their way !

Fred.


-----Original Message----- From: Talpey, Thomas
[mailto:[EMAIL PROTECTED] Sent: Tuesday, 11 November, 2008
17:02 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc: Jeff Becker;
[email protected] Subject: RE: [ofa-general] NFS-RDMA
(OFED1.4) with standard distributions ?

At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC)
wrote:
That's great, thanks.

I ran some tests with the 2.6.27 kernel as server and client, and basically it works fine.

I could not find yet any situation where NFS-RDMA would outperform NFS/IPoIB, at least when you compare apples to apples (same
clients, same server, same protocol, and not just write to/read
from the caches), and it even seems to have severe performance
issues for reading with files larger than the memory size of the
client and the server. Hopefully this will improve when more users
will be able to give valuable feedback...

I have a couple of questions, and perhaps suggestions as well. First
the questions...

- Have you tried with a 2.6.28-rc4 client and server at all? There
are a number of significant NFS/RDMA improvements queued in
kernel.org, especially around RDMA memory registration as well as
RDMA operation scheduling. We've seen some significant throughput
improvement even for basic tunings.

- What type of storage are you using at the server, and have you
attempted to tune the server at all? For example, if you are storage (spindle) limited, no network tuning is likely to help and you should
address that first. Also, there are tunings such as nfsd thread
count, export options, and adapter choice that can make a large
difference.

Bottom line, you should be able to reach multi-hundred-MB/sec of
read/write throughput with NFS/RDMA, but there may be issues on
specific systems, or perhaps with the OFED1.4 code, that need to be
accounted for. If possible, you may want to set expectations based on
mainline, then try to duplicate them in the OFED backport. The
current OFED NFS/RDMA support is still evolving, while we consider
the mainline kernel.org version to be rather solid.

Tom.

Fred.

-----Original Message----- From: Jeff Becker
[mailto:[EMAIL PROTECTED] Sent: Saturday, 08 November,
2008 22:35 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc:
[email protected] Subject: Re: [ofa-general] NFS-RDMA
(OFED1.4) with standard distributions ?

Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
Is there any chance that the new NFS-RDMA features coming with
OFED 1.4 work with standard and current distributions, like
RHEL5, SLES10 ?
Not yet, but I'm working on it. I intend for NFSRDMA to work on
2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will
likely be done for OFED 1.4.1. Thanks.

-jeff

Did anybody test this, or would pretend it is supposed to work ?

I mean without building a 2.6.27 or equivalent kernel on top of
it, keeping almost full support from the vendors.

Enhanced kernel modules may not be sufficient to work around the limitations of old kernels...




_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to