Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

Adi Kriegisch Wed, 06 Jul 2011 09:52:43 -0700

Hi!

> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
> >> single block device.
> >> The RAID itself is a RAID6 configuration, using default settings.
> >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
> > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
> > see them.
> >
> 
> I'm too happy about this either.
> My intention in the start was to get the RAID controller to just
> expose the disks,
> and let Linux handle the RAID side of things.
> However, I was unsuccessful in convincing the RAID controller to do so.
Too bad... I'd prefer a Linux software RAID too...
btw. there are hw-raid management tools available for linux. You probably
want to check out http://hwraid.le-vert.net/wiki.
 
> > Could you try to create the file system with "-E 
> > stride=16,stripe-width=16*(N-2)"
> > where N is the number of disks in the array. There are plenty of sites out
> > there about finding good parameters for mkfs and RAID (like
> > http://www.altechnative.net/?p=96 or
> > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
> >
> 
> The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.
correct.


> I created the filesystem as you suggested, the resulting output from mkfs was:
> root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest
[SNIP]
> I then mounted the newly created filesystem on the server, and have it
> a run with bonnie.
> Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with
> the default settings.
> 
> When I exported it using AoE, the throughput on the client was ~60
> MB/s, down from ~70 MB/s.
The values you used are correct for 3 data disks with 64K chunk size.
Probably this issue is related to a misalignment of LVM. LVM adds a header
which has a default size of 192K -- that would perfectly match your
RAID: 3*64K = 192K...
but the default "physical extent" size does not match your RAID: 4MB cannot
be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks
aren't propperly aligned -- I doubt you can align them, because the
physical extent size needs to be a power of two and > 1K and to be aligned
with the RAID divideable by 192... The only way could be to change the
number of disks in the array to 4 or 6. :-(
Could you just once try to use the raw device with the above used stride
and stripe-width values? (without LVM inbetween)

> Thanks, I appreciate the help from you and all the others
> who have been very helpful here on aoetools-discuss.
You're welcome! And thank you very much for always reporting back the
results.

> What I'm not quite understanding is how exporting a device via AoE
> would introduce new alignment problems or similar.
> When I can write to the local filesystem at ~370 MB/s, what kind of
> problem is introduced by using AoE or other network storage solution ?
> 
> I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
> exact same ~70 MB/s throughput there, so I guess this isn't related to
> AoE in itself.
There are two root causes for these issues:
* SAN protocols force a "commit" of unwritten data, be it a "sync", direct
  i/o or whatever, way more often than local disks -- for the sake of data
  integrity. (actually write barriers should be enabled for all those AoE
  devices -- especially with newer kernels.)
* AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
  into a jumbo frame). So all I/O is aligned around this size. When using a
  filesystem like ext4 or xfs one can influence the block sizes by creating
  the file system properly.

And now for some ascii art:
lets say a simple hard disk has the following physical blocks:
+----+----+----+----+----+----+----+----+----+----+-..-+
| 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | .. |
+----+----+----+----+----+----+----+----+----+----+-..-+

then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks
looks like this (D1 1-2 means disk1 block 1 and 2):
+----+----+----+----+----+----+----+----+----+----+-..-+
| D1 1-2  | D2 1-2  | D3 1-2  | D4 1-2  | D1 3-4  | .. |
+----+----+----+----+----+----+----+----+----+----+-..-+
\------------ DATA -----------/\-PARITY-/
\                                      / \
 ----------- RAID block 1 -------------   --------- ..

One data block of this RAID can only be written at once. So whenever only
one bit within that block changes, the whole block has to written again
(because the checksum is only valid for the block as a whole).

Now imagine, you write you have a lvm header that has half of the size of a
RAID block: it will fill the first half of the block and the first lvm
volume will then fill the rest of the first block plus some more blocks and
a half at the end. Write operations are not alligned then and cause massive
rewrites in the backend.

>From my point of view there are several ways to find the root cause of the
issues:
* try a different RAID level (like 10 or so)
* (re)-try to export the disks to Linux as JBODs.
* try different filesystem and lvm parameters (actually you better write a
  script for that... ;-)

And, let us know about the results!
Thanks,
        Adi

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

Reply via email to