Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

David Merhar Thu, 27 Jan 2011 10:09:48 -0800

Appreciate the input.

We've been using mode 6 as I expect it provides the fewestconfiguration pratfalls. IF the single stream becomes our bottleneckwe'll mess with aggregation.

What I can't find is the bottleneck in our current setup. With 4servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughputwhere each client has a single connection to each OST. Instead we'relimited to 2GB, where each OSS appears limited to 1Gb of I/O. Thestrangeness is that iptraf on the OSSs shows traffic through theexpected connections (2 X 2) but at only 35% - 65% of bandwidth.

And a third client writing to the filesystem will briefly increaseaggregate throughput, but it quickly settles back to ~2Gb.


djm




On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:

Normally if you are having a problem with write BW, you need to futzwith the switch. If you were havingproblems with read BW, you need to futz with the server's config(xmit hash policy is the usual culprit).
Are you testing multiple clients to the same server?
Are you using mode 6 because you don't have bonding support in yourswitch? I normally use 802.3ad mode,
assuming your switch supports link aggregation.
I was bonding 2x1Gb links for Lustre back in 2004. That was beforeBOND_XMIT_POLICY_LAYER34was in the kernel, so I had to hack the bond xmit hash (withmultiple NICs standard, layer2 hashing does notproduce a uniform distribution, and can't work if going through arouter).
Any one connection (socket or node/node connection) will use onlyone gigabit link. While it is possibleto use two links using round-robin, that normally only helps forclient reads (server can't choose which link toreceive data, the switch picks that), and has the serious downsideof out-of-order packets on the TCP stream.
[If you want clients to have better client bandwidth for a singlefile, change your default stripe count to 2, so it
will hit two different servers.]

Kevin


David Merhar wrote:
Sorry - little b all the way around.

We're limited to 1Gb per OST.

djm



On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
I guess you have two gigabit nics bonded in mode 6 and not two1GB nics?(B-Bytes, b-bits) The max aggregate throughput could be about200MBps
out of the 2 bonded nics. I think the mode 0 bonding works only with
cisco etherchannel or something similar on the switch side. Samewith
the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
throughout. Maybe you could also see the max read and writecapabilitiesof the raid controller other than just the network. When testingwith
dd, some of the data remains as dirty data till its flushed into the
disk. I think the default background ratio is 10% for rhel5 whichwouldbe sizable if your oss have lots of ram. There is chance of lockupofthe oss once it hits the dirty_ratio limit,which is 40% bydefault. So abit more aggressive flush to disk by lowering thebackground_ratio and abit more headroom before it hits the dirty_ratio is generallydesirable
if your raid controller could keep up with it. So with your current
setup, i guess you could get a max of 400MBps out of both OSS'sif theyboth have two 1Gb nics in them. Maybe if you have one of theswitchesfrom Dell that has 4 10Gb ports in them (their powerconnect6248), 10Gb
nics for your OSS's might be a cheaper way to increase the aggregate
performance. I think over 1GBps from a client is possible incases where
you use infiniband and rdma to deliver data.


David Merhar wrote:
Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
write throughput each.

Our setup:
2 OSS serving 1 OST each
Lustre 1.8.5
RHEL 5.4
New Dell M610's blade servers with plenty of CPU and RAM
All SAN fibre connections are at least 4GB

Some notes:
- A direct write (dd) from a single OSS to the OST gets 4GB, theOSS's
fibre wire speed.
- A single client will get 2GB of lustre write speed, the client's
ethernet wire speed.
- We've tried bond mode 6 and 0 on all systems. With mode 6 wewill
see both NICs on both OSSs receiving data.
- We've tried multiple OSTs per OSS.

But 2 clients writing a file will get 2GB of total bandwidth to the
filesystems. We have been unable to isolate any particularresource
bottleneck.  None of the systems (MDS, OSS, or client) seem to be
working very hard.
The 1GB per OSS threshold is so consistent, that it almostappears by
design - and hopefully we're missing something obvious.

Any advice?

Thanks.

djm



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Reply via email to