Re: recommendations for stripe/chunk size
Dear Nail, in message [EMAIL PROTECTED] you wrote: quote The second improvement is to remove a memory copy that is internal to the MD driver. The MD driver stages strip data ready to be written next to the I/O controller in a page size pre- allocated buffer. It is possible to bypass this memory copy for sequential writes thereby saving SDRAM access cycles. /quote I sure hope you've checked that the filesystem never (ever) changes a buffer while it is being written out. Otherwise the data written to disk might be different from the data used in the parity calculation :-) Sure. Note that usage szenarios of this implementation are not only (actually not even primarily) focussed on using such a setup as normal RAID server - instead processors like the 440SPe will likely be used on RAID controller cards itself - and data may come from iSCSI or over one of the PCIe busses, but not from a normal file system. And what are the Second memcpy and First memcpy in the graph? I assume one is the memcpy mentioned above, but what is the other? Avoiding the 1st memcpy means to skip the system block level caching, i. e. try to use DIRECT_IO capability (-dio option to xdd tool which was used for these benchmarks). The 2nd memcpy is the optimization for large sequential writes you quoted above. Please keep in mind that these optimizations are probably not directly useful for general purpose use of a normal file system on top of the RAID array; they have other goals: provide benchmarks for the special case of large synchrounous I/O operations (as used by RAID controller manufacturers to show off their competitors), and to provide a base for the firmware of such controllers. Nevertheless, they clearly show where optimizations are possible, assuming you understand exactly your usuage szenario. In real life, your optimization may require completely different strategies - for example, on our main file server we see such a distribution of file sizes: Out of a sample of 14.2e6 files, 65%are smaller than 4 kB 80%are smaller than 8 kB 90%are smaller than 16 kB 96%are smaller than 32 kB 98.4% are smaller than 64 kB You don't want - for example - huge stripe sizes in such a system. Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED] Egotist: A person of low taste, more interested in himself than in me. - Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Thu, Feb 07, 2008 at 06:40:12AM +0100, Iustin Pop wrote: On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote: Anyway, why does a SATA-II drive not deliver something like 300 MB/s? Wait, are you talking about a *single* drive? Yes, I was talking about a single drive. In that case, it seems you are confusing the interface speed (300MB/s) with the mechanical read speed (80MB/s). I thought the 300 MB/S was the transfer rate between the disk and the controllers memory in its buffers, but you indicate that this is the speed between the controller's buffers and main RAM. I am, as Neil, amazed by the speeds that we get on current hardware, but still I would like to see if we could use the hardware better. Asyncroneous IO could be a way forward. I have written some mainframe utilities where asyncroneous IO was the key to the performance, so I thought that it could also become handy in the Linux kernel. If about 80 MB/s is the maximum we can get out of a current SATA-II 7200 rpm drive, then I think there is not much to be gained from asyncroneous IO. If you are asking why is a single drive limited to 80 MB/s, I guess it's a problem of mechanics. Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest I've seen on 7200 RPM drives. And yes, there is no wait until the CPU processes the current data until the drive reads the next data; drives have a builtin read-ahead mechanism. Honestly, I have 10x as many problems with the low random I/O throughput rather than with the (high, IMHO) sequential I/O speed. I agree that random IO is the main factor on most server installations. But on workstations the sequentioal IO is also important, as the only user is sometimes waiting for the computer to respond. And then I think that booting can benefit from faster sequential IO. And not to forget, I think it is fun to make my hardware run faster! best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. Depending on the raid level, a write smaller than the chunk size causes the chunk to be read, altered, and rewritten, vs. just written if the write is a multiple of chunk size. Many filesystems by default use a 4k page size and writes. I believe this is the reasoning behind the suggestion of small chunk sizes. Sequential vs. random and raid level are important here, there's no one size to work best in all cases. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. Having a write not some multiple of chunk size would seem to require a read-alter- wait_for_disk_rotation-write, and for large sustained sequential i/o using multiple drives helps transfer. for small random i/o small chunks are good, I find little benefit to chunks over 256 or maybe 1024k. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. If you actually see anything like this your write caching and readahead aren't doing what they should! I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? I think usage is more important than hardware. My opinion only. Best regards Keld -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED] You got to learn three things. What's real, what's not real, and what's the difference. - Terry Pratchett, _Witches Abroad_ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Wolfgang Denk wrote: In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. That was actually what I meant by page size, that of the file system rather than the memory, ie. the block size typically used for writes. Or multiples thereof, obviously. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf I started that online and pulled a download to print, very neat stuff. Thanks for the link. Best regards, Wolfgang Denk -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wed, Feb 06, 2008 at 09:25:36PM +0100, Wolfgang Denk wrote: In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Yes, that is also what I would expect, for sequential reads. Random writes of small data blocks, kind of what is done in bug data bases, should show another picture as others also have described. If you look at a single disk, would you get improved performance with the asyncroneous IO? I am a bit puzzled about my SATA-II performance: nominally I could get 300 MB/s on SATA-II, but I only get about 80 MB/s. Why is that? I thought it was because of latency with syncroneous reads. Ie, when a chunk is read, yo need to complete the IO operation, and then issue an new one. In the meantime while the CPU is doing these calculations, te disk has spun a little, and to get the next data chunk, we need to wait for the disk to spin around to have the head positioned over the right data pace on the disk surface. Is that so? Or does the controller take care of this, reading the rest of the not-yet-requested track into a buffer, which then can be delivered next time. Modern disks often have buffers of about 8 or 16 MB. I wonder why they don't have bigger buffers. Anyway, why does a SATA-II drive not deliver something like 300 MB/s? best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote: Anyway, why does a SATA-II drive not deliver something like 300 MB/s? Wait, are you talking about a *single* drive? In that case, it seems you are confusing the interface speed (300MB/s) with the mechanical read speed (80MB/s). If you are asking why is a single drive limited to 80 MB/s, I guess it's a problem of mechanics. Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest I've seen on 7200 RPM drives. And yes, there is no wait until the CPU processes the current data until the drive reads the next data; drives have a builtin read-ahead mechanism. Honestly, I have 10x as many problems with the low random I/O throughput rather than with the (high, IMHO) sequential I/O speed. regards, iustin - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wednesday February 6, [EMAIL PROTECTED] wrote: We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Thanks for the link! quote The second improvement is to remove a memory copy that is internal to the MD driver. The MD driver stages strip data ready to be written next to the I/O controller in a page size pre- allocated buffer. It is possible to bypass this memory copy for sequential writes thereby saving SDRAM access cycles. /quote I sure hope you've checked that the filesystem never (ever) changes a buffer while it is being written out. Otherwise the data written to disk might be different from the data used in the parity calculation :-) And what are the Second memcpy and First memcpy in the graph? I assume one is the memcpy mentioned above, but what is the other? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wednesday February 6, [EMAIL PROTECTED] wrote: Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. Depending on the raid level, a write smaller than the chunk size causes the chunk to be read, altered, and rewritten, vs. just written if the write is a multiple of chunk size. Many filesystems by default use a 4k page size and writes. I believe this is the reasoning behind the suggestion of small chunk sizes. Sequential vs. random and raid level are important here, there's no one size to work best in all cases. Not in md/raid. RAID4/5/6 will do a read-modify-write if you are writing less than one *page*, but then they often to read-modify-write anyway for parity updates. No level will every read a whole chunk just because it is a chunk. To answer the original question: The only way to be sure is to test your hardware with your workload with different chunk sizes. But I suspect that around 256K is good on current hardware. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Thursday February 7, [EMAIL PROTECTED] wrote: Anyway, why does a SATA-II drive not deliver something like 300 MB/s? Are you serious? I high end 15000RPM enterprise grade drive such as the Seagate Cheetah® 15K.6 Hard Drives only deliver 164MB/sec. The SATA Bus might be able to deliver 300MB/s, but an individual drive would be around 80MB/s unless it is really expensive. (or was that yesterday? I'm having trouble keeping up with the pace of improvement :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? Best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much below or too much over that range results in degradation. Justin.