Hi there!
I'm curious about some of the issues relating to RAID performance under
linux. I don't have a hardware or device driver kind of background though,
so I have a couple of questions for y'all... the meat of this stuff really
isn't explained in the HOWTO's.
I've worked with computers for a good while, and have worked with a good
number of systems with various RAID levels, but always took them for
granted... they were someone else's problem.
Just by thinking about it, I always had the impression that RAID in many
cases would be a dramatic speed-up. Server vendors definitely give you that
impression, and from a programmers point of view it made sense: 5 drives to
read the data off of must go a lot faster than one...
Last fall I set up a big file server as a personal project. It was
surprisingly easy for me to get going, but in the process I learned that
linux-raid (maybe raid in general?) serializes it's access to the disks...
so if you have a five drive RAID5 set, you would read in raid-order instead
of simultaneously reading from all the disks at once. So, ok, it's still
faster because it can somewhat over-lap requests, but it doesn't seem to be
taking advantage of all the horsepower the drives have to offer.
I guess my first questions is: Is this true? That raid reads are
serialized?
I have an ide based raid system, and visually it appears to be true. 10
drives stacked in a case with external access lights show the first drive
set being hit, then the second and so on. It zips down the case fast, but
it's noticeably going one way, and not all at once or randomly.
I understand that there are serious caveats to ide disk interfaces doing
concurrent access... they can't. I'm not in any way complaining about my
specific performance... 1/10th the performance that I get would still be
able to well & saturate our link to the Internet.
But even if you can only access one disk on an interface at once (with, I
understand, significant overhead in switching between them), couldn't you
still access one disk on each channel at the same time, and then switch to
the other disks? Hypothetically with my 10 disk set I could still read from
half of them at once, seeing as how they're on that many different
interfaces.
In fact, isn't this how linux (or Solaris) would read from physically
independent disks? Queue up reads on each one? This seems to be true, as
if I run hdparm -Tt on each of my physical disks at once, I get solid disk
access lights on all my drives, and hdparm reports results consistent with
half the drives being accessed at once... with a noticeable CPU falloff...
but still, combined, performance is better than the RAID5 set by a long
margin.
SCSI also seems to suffer from this fate, as people don't report benchmarks
that don't show anything close to the improvement one might expect from
disks being read in parallel.
I was even surprised to see benchmarks from PC hardware raid systems that
showed lackluster performance compared to linux software raid. I mean, if
anybody should be able to build something that will read from disks in
parallel, it should be the hardware vendors!
It strikes me that all these impressions are based on bonnie. Now, bonnie
AFAIK is single-threaded, and only does one kind of disk access at a time.
That doesn't really simulate user load, since you'll have a lot of folks
accessing different parts of the disk. I understand that RAID5 helps a good
bit here, at least as far as seeks/sec goes in bonnie output. Do the drives
read in parallel if they're seeking all over the place, under user load?
I also wonder how chunksize affects all this. I set up a chunksize of 128k,
with 4k blocks and the appropriate mke2fs stride commands. Following
conventional logic, this should be right... All of my files are about 5MB
in length, and are read serially.
Now, if linux doesn't read blocks ahead sufficiently, then I'm just
basically reading off of one disk for a while and switching to the next. It
seems like for RAID to even have a chance of maxing out the drives on a big
serial raid it would need to know that the OS wanted a lot of data... say in
this case, 128k * 10 drives = over a megabyte "read-ahead". Does linux do
that, request that much data at once?
If it doesn't, but say it does read ahead, oh 64k, wouldn't smaller
chunksizes in theory get all the disks moving at once? with a 4k chunksize
and a read ahead of 64k, raid would at least have the knowledge that it
needs data from all the drives, right now...
But aren't small chunksizes not that good, or not-recommended for large
files? Or am I just confused with block size?
I've thought that maybe raid serializes because of kernel/hardware issues.
It seems plausible, as I understand the kernel is single threaded, can it
not queue up reads for multiple disks and have them dump into memory (isn't
this DMA?), do you have to block on a read until it's returned? If you do,
would this be solved by moving the raid system to user space where you could
run threaded? Or would threads also block? Am I way off?
Sorry this is so long, obviously I've been mulling it over for a while, but
I haven't been able to find technical discussions like this out there. Are
there any pointers for this kind of info?
Take care,
Tom