Appreciate the input.
We've been using mode 6 as I expect it provides the fewest
configuration pratfalls. IF the single stream becomes our bottleneck
we'll mess with aggregation.
What I can't find is the bottleneck in our current setup. With 4
servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput
where each client has a single connection to each OST. Instead we're
limited to 2GB, where each OSS appears limited to 1Gb of I/O. The
strangeness is that iptraf on the OSSs shows traffic through the
expected connections (2 X 2) but at only 35% - 65% of bandwidth.
And a third client writing to the filesystem will briefly increase
aggregate throughput, but it quickly settles back to ~2Gb.
djm
On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:
Normally if you are having a problem with write BW, you need to futz
with the switch. If you were having
problems with read BW, you need to futz with the server's config
(xmit hash policy is the usual culprit).
Are you testing multiple clients to the same server?
Are you using mode 6 because you don't have bonding support in your
switch? I normally use 802.3ad mode,
assuming your switch supports link aggregation.
I was bonding 2x1Gb links for Lustre back in 2004. That was before
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with
multiple NICs standard, layer2 hashing does not
produce a uniform distribution, and can't work if going through a
router).
Any one connection (socket or node/node connection) will use only
one gigabit link. While it is possible
to use two links using round-robin, that normally only helps for
client reads (server can't choose which link to
receive data, the switch picks that), and has the serious downside
of out-of-order packets on the TCP stream.
[If you want clients to have better client bandwidth for a single
file, change your default stripe count to 2, so it
will hit two different servers.]
Kevin
David Merhar wrote:
Sorry - little b all the way around.
We're limited to 1Gb per OST.
djm
On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
I guess you have two gigabit nics bonded in mode 6 and not two
1GB nics?
(B-Bytes, b-bits) The max aggregate throughput could be about
200MBps
out of the 2 bonded nics. I think the mode 0 bonding works only with
cisco etherchannel or something similar on the switch side. Same
with
the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
throughout. Maybe you could also see the max read and write
capabilities
of the raid controller other than just the network. When testing
with
dd, some of the data remains as dirty data till its flushed into the
disk. I think the default background ratio is 10% for rhel5 which
would
be sizable if your oss have lots of ram. There is chance of lockup
of
the oss once it hits the dirty_ratio limit,which is 40% by
default. So a
bit more aggressive flush to disk by lowering the
background_ratio and a
bit more headroom before it hits the dirty_ratio is generally
desirable
if your raid controller could keep up with it. So with your current
setup, i guess you could get a max of 400MBps out of both OSS's
if they
both have two 1Gb nics in them. Maybe if you have one of the
switches
from Dell that has 4 10Gb ports in them (their powerconnect
6248), 10Gb
nics for your OSS's might be a cheaper way to increase the aggregate
performance. I think over 1GBps from a client is possible in
cases where
you use infiniband and rdma to deliver data.
David Merhar wrote:
Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
write throughput each.
Our setup:
2 OSS serving 1 OST each
Lustre 1.8.5
RHEL 5.4
New Dell M610's blade servers with plenty of CPU and RAM
All SAN fibre connections are at least 4GB
Some notes:
- A direct write (dd) from a single OSS to the OST gets 4GB, the
OSS's
fibre wire speed.
- A single client will get 2GB of lustre write speed, the client's
ethernet wire speed.
- We've tried bond mode 6 and 0 on all systems. With mode 6 we
will
see both NICs on both OSSs receiving data.
- We've tried multiple OSTs per OSS.
But 2 clients writing a file will get 2GB of total bandwidth to the
filesystems. We have been unable to isolate any particular
resource
bottleneck. None of the systems (MDS, OSS, or client) seem to be
working very hard.
The 1GB per OSS threshold is so consistent, that it almost
appears by
design - and hopefully we're missing something obvious.
Any advice?
Thanks.
djm
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss