We use adaptec 51245s and 51645s with

1. max_hw_sectors_kb=512
2. RAID5 4+1 or RAID6 4+2
3. RAID chunk size = 128

So each 1 MB lustre RPC results in two 4-way, striped writes with no read-modify-write penalty. We can further improve write performance by matching the max_pages_per_rpc (per OST on the client side) i.e. the max rpc size to the max_hw_sectors_kb setting for the block devices. In this case

max_pages_per_rpc=128

instead of the default 256 at which point you have 1 raid-stripe write per rpc.

If you put your OSTs atop LVs (LVM2) as we do, you will want to take the additional step of making sure your LVs are aligned as well.

pvcreate --dataalignment 1024S /dev/sd$driveChar

You need a fairly new version of the LVM2 that supports the -- dataalignment option. We are using lvm2-2.02.56-8.el5_5.6.x86_64.

Note that we attempted to increase the max_hw_sectors_kb for the block devices (RAID LDs) to 1024 but in order to do so, we needed to change the adaptec driver (aacraid) kernel parameter acbsize=8192 which we found to be unstable. For our adaptec drivers we use..

options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096

Note that most of the information above was the result of testing and tuning performed here by Craig Prescott.

We now have close to a PB of such storage in production here at the UF HPC Center. We used Areca cards at first but found them to be a bit too flakey for our needs. The adaptecs seem to have some infant mortality issues. We RMA about 10% to 12% percent of newly purchased cards but if they make it past initial burn-in testing, they tend to be pretty reliable.

Regards,

Charlie Taylor
UF HPC Center









On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote:

Hi,

I have been testing some LSI 9260 RAID cards for use with Lustre v1.8.6 but have found that the "megaraid_sas" driver is not really able to facilitate the 1MB full stripe IOs that Lustre likes. This topic has also come up recently in the following two email threads:

http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb#
http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab

I was able to up the max_hw_sectors_kb -> 1024 by setting the "max_sectors" megaraid_sas module option but found that the IOs were still being pretty fragmented:

disk I/O size          ios   % cum % |  ios   % cum %
4K:                   3060   0   0   | 2611   0   0
8K:                   3261   0   0   | 2664   0   0
16K:                  6408   0   1   | 5296   0   1
32K:                 13025   1   2   | 10692   1   2
64K:                 48397   4   6   | 26417   2   4
128K:                50166   4  10   | 42218   4   9
256K:               113124   9  20   | 86516   8  17
512K:               677242  57  78   | 448231  45  63
1M:                 254195  21 100   | 355804  36 100

So next I looked at the sg_tablesize and found it was being set to "80" by the driver (which queries the firmware). I tried to hack the driver and increase this value but bad things happened and so it looks like it is a genuine hardware limit with these cards.

The overall throughput isn't exactly terrible because the RAID write- back cache does a reasonable job but I suspect it could be better, e.g.

ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52 [ 231.75, 529.96] read 940.26 [ 275.70, 357.60] ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19 [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52] ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79 [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43] ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87 [ 171.82, 466.77] read 901.03 [ 257.73, 372.87] ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11 [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51]

All of this brings me to my main question - what internal cards have people here used which work well with Lustre? 3ware, Areca or other models of LSI?

Cheers,

Daire
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to