Hi! > >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a > >> single block device. > >> The RAID itself is a RAID6 configuration, using default settings. > >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB. > > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot > > see them. > > > > I'm too happy about this either. > My intention in the start was to get the RAID controller to just > expose the disks, > and let Linux handle the RAID side of things. > However, I was unsuccessful in convincing the RAID controller to do so. Too bad... I'd prefer a Linux software RAID too... btw. there are hw-raid management tools available for linux. You probably want to check out http://hwraid.le-vert.net/wiki. > > Could you try to create the file system with "-E > > stride=16,stripe-width=16*(N-2)" > > where N is the number of disks in the array. There are plenty of sites out > > there about finding good parameters for mkfs and RAID (like > > http://www.altechnative.net/?p=96 or > > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). > > > > The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity. correct.
> I created the filesystem as you suggested, the resulting output from mkfs was: > root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest [SNIP] > I then mounted the newly created filesystem on the server, and have it > a run with bonnie. > Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with > the default settings. > > When I exported it using AoE, the throughput on the client was ~60 > MB/s, down from ~70 MB/s. The values you used are correct for 3 data disks with 64K chunk size. Probably this issue is related to a misalignment of LVM. LVM adds a header which has a default size of 192K -- that would perfectly match your RAID: 3*64K = 192K... but the default "physical extent" size does not match your RAID: 4MB cannot be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks aren't propperly aligned -- I doubt you can align them, because the physical extent size needs to be a power of two and > 1K and to be aligned with the RAID divideable by 192... The only way could be to change the number of disks in the array to 4 or 6. :-( Could you just once try to use the raw device with the above used stride and stripe-width values? (without LVM inbetween) > Thanks, I appreciate the help from you and all the others > who have been very helpful here on aoetools-discuss. You're welcome! And thank you very much for always reporting back the results. > What I'm not quite understanding is how exporting a device via AoE > would introduce new alignment problems or similar. > When I can write to the local filesystem at ~370 MB/s, what kind of > problem is introduced by using AoE or other network storage solution ? > > I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the > exact same ~70 MB/s throughput there, so I guess this isn't related to > AoE in itself. There are two root causes for these issues: * SAN protocols force a "commit" of unwritten data, be it a "sync", direct i/o or whatever, way more often than local disks -- for the sake of data integrity. (actually write barriers should be enabled for all those AoE devices -- especially with newer kernels.) * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits into a jumbo frame). So all I/O is aligned around this size. When using a filesystem like ext4 or xfs one can influence the block sizes by creating the file system properly. And now for some ascii art: lets say a simple hard disk has the following physical blocks: +----+----+----+----+----+----+----+----+----+----+-..-+ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | .. | +----+----+----+----+----+----+----+----+----+----+-..-+ then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks looks like this (D1 1-2 means disk1 block 1 and 2): +----+----+----+----+----+----+----+----+----+----+-..-+ | D1 1-2 | D2 1-2 | D3 1-2 | D4 1-2 | D1 3-4 | .. | +----+----+----+----+----+----+----+----+----+----+-..-+ \------------ DATA -----------/\-PARITY-/ \ / \ ----------- RAID block 1 ------------- --------- .. One data block of this RAID can only be written at once. So whenever only one bit within that block changes, the whole block has to written again (because the checksum is only valid for the block as a whole). Now imagine, you write you have a lvm header that has half of the size of a RAID block: it will fill the first half of the block and the first lvm volume will then fill the rest of the first block plus some more blocks and a half at the end. Write operations are not alligned then and cause massive rewrites in the backend. >From my point of view there are several ways to find the root cause of the issues: * try a different RAID level (like 10 or so) * (re)-try to export the disks to Linux as JBODs. * try different filesystem and lvm parameters (actually you better write a script for that... ;-) And, let us know about the results! Thanks, Adi ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Aoetools-discuss mailing list Aoetools-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/aoetools-discuss