Re: recommendations for stripe/chunk size

2008-02-07 Thread Wolfgang Denk
Dear Nail,

in message [EMAIL PROTECTED] you wrote:
 
 quote
 The second improvement is to remove a memory copy that is internal to the MD 
 driver. The MD
 driver stages strip data ready to be written next to the I/O controller in a 
 page size pre-
 allocated buffer. It is possible to bypass this memory copy for sequential 
 writes thereby saving
 SDRAM access cycles.
 /quote
 
 I sure hope you've checked that the filesystem never (ever) changes a
 buffer while it is being written out.  Otherwise the data written to
 disk might be different from the data used in the parity calculation
 :-)

Sure. Note that usage szenarios of this implementation are  not  only
(actually  not  even  primarily)  focussed  on  using such a setup as
normal RAID server - instead processors like the 440SPe  will  likely
be  used  on  RAID  controller  cards itself - and data may come from
iSCSI or over one of the PCIe busses, but  not  from  a  normal  file
system.

 And what are the Second memcpy and First memcpy in the graph?
 I assume one is the memcpy mentioned above, but what is the other?

Avoiding the 1st memcpy means to skip the system block level caching,
i. e. try to use DIRECT_IO capability  (-dio  option  to  xdd  tool
which was used for these benchmarks).

The 2nd memcpy is the optimization for large  sequential  writes  you
quoted above.

Please keep  in  mind  that  these  optimizations  are  probably  not
directly  useful  for  general purpose use of a normal file system on
top of the RAID array; they have other goals: provide benchmarks  for
the  special  case  of  large synchrounous I/O operations (as used by
RAID controller manufacturers to show off their competitors), and  to
provide a base for the firmware of such controllers.

Nevertheless, they clearly show  where  optimizations  are  possible,
assuming you understand exactly your usuage szenario.

In real life, your  optimization  may  require  completely  different
strategies  -  for  example,  on  our  main file server we see such a
distribution of file sizes:

Out of a sample of 14.2e6 files,

 65%are smaller than  4 kB
 80%are smaller than  8 kB
 90%are smaller than 16 kB
 96%are smaller than 32 kB
 98.4%  are smaller than 64 kB

You don't want - for example - huge stripe sizes in such a system.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH, MD: Wolfgang Denk  Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED]
Egotist: A person of low taste, more interested in  himself  than  in
me.  - Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-07 Thread Keld Jørn Simonsen
On Thu, Feb 07, 2008 at 06:40:12AM +0100, Iustin Pop wrote:
 On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote:
  Anyway, why does a SATA-II drive not deliver something like 300 MB/s?
 
 Wait, are you talking about a *single* drive?

Yes, I was talking about a single drive.

 In that case, it seems you are confusing the interface speed (300MB/s)
 with the mechanical read speed (80MB/s).

I thought the 300 MB/S was the transfer rate between the disk and the
controllers memory in its buffers, but you indicate that this is the
speed between the controller's buffers and main RAM. 

I am, as Neil, amazed by the speeds that we get on current hardware, but
still I would like to see if we could use the hardware better.
Asyncroneous IO could be a way forward. I have written some mainframe
utilities where asyncroneous IO was the key to the performance,
so I thought that it could also become handy in the Linux kernel. 

If about 80 MB/s is the maximum we can get out of a current SATA-II
7200 rpm drive, then I think there is not much to be gained from
asyncroneous IO. 

 If you are asking why is a
 single drive limited to 80 MB/s, I guess it's a problem of mechanics.
 Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest
 I've seen on 7200 RPM drives. And yes, there is no wait until the CPU
 processes the current data until the drive reads the next data; drives
 have a builtin read-ahead mechanism.
 
 Honestly, I have 10x as many problems with the low random I/O throughput
 rather than with the (high, IMHO) sequential I/O speed.

I agree that random IO is the main factor on most server installations.
But on workstations the sequentioal IO is also important, as the only
user is sometimes waiting for the computer to respond.
And then I think that booting can benefit from faster sequential IO.
And not to forget, I think it is fun to make my hardware run faster!

best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB. 

  
Depending on the raid level, a write smaller than the chunk size causes 
the chunk to be read, altered, and rewritten, vs. just written if the 
write is a multiple of chunk size. Many filesystems by default use a 4k 
page size and writes. I believe this is the reasoning behind the 
suggestion of small chunk sizes. Sequential vs. random and raid level 
are important here, there's no one size to work best in all cases.
My own take on that is that this really hurts performance. 
Normal disks have a rotation speed of between 5400 (laptop)

7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms. 

  
Having a write not some multiple of chunk size would seem to require a 
read-alter- wait_for_disk_rotation-write, and for large sustained 
sequential i/o using multiple drives helps transfer. for small random 
i/o small chunks are good, I find little benefit to chunks over 256 or 
maybe 1024k.
in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 
something like between 600 to 1200 kB, actual transfer rates of

80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk 
you should have a time of 15 ms per transaction, or 66 transactions per 
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50. 

  
If you actually see anything like this your write caching and readahead 
aren't doing what they should!



I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

  

I think usage is more important than hardware. My opinion only.


Best regards
Keld



--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Wolfgang Denk
In message [EMAIL PROTECTED] you wrote:

  I actually  think the kernel should operate with block sizes
  like this and not wth 4 kiB blocks. It is the readahead and the elevator
  algorithms that save us from randomly reading 4 kb a time.
 

 Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.

Indeed kernel page size is an important factor in such optimizations.
But you have to keep in mind that this is mostly efficient for (very)
large strictly sequential I/O operations only -  actual  file  system
traffic may be *very* different.

We implemented the option to select kernel page sizes of  4,  16,  64
and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
graphics of the effect can be found here:

https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH, MD: Wolfgang Denk  Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED]
You got to learn three things. What's  real,  what's  not  real,  and
what's the difference.   - Terry Pratchett, _Witches Abroad_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Wolfgang Denk wrote:

In message [EMAIL PROTECTED] you wrote:
  

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  
  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.



Indeed kernel page size is an important factor in such optimizations.
But you have to keep in mind that this is mostly efficient for (very)
large strictly sequential I/O operations only -  actual  file  system
traffic may be *very* different.

  
That was actually what I meant by page size, that of the file system 
rather than the memory, ie. the block size typically used for writes. 
Or multiples thereof, obviously.

We implemented the option to select kernel page sizes of  4,  16,  64
and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
graphics of the effect can be found here:

https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

  
I started that online and pulled a download to print, very neat stuff. 
Thanks for the link.

Best regards,

Wolfgang Denk

  



--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 09:25:36PM +0100, Wolfgang Denk wrote:
 In message [EMAIL PROTECTED] you wrote:
 
   I actually  think the kernel should operate with block sizes
   like this and not wth 4 kiB blocks. It is the readahead and the elevator
   algorithms that save us from randomly reading 4 kb a time.
  
 
  Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.
 
 Indeed kernel page size is an important factor in such optimizations.
 But you have to keep in mind that this is mostly efficient for (very)
 large strictly sequential I/O operations only -  actual  file  system
 traffic may be *very* different.
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Yes, that is also what I would expect, for sequential reads.
Random writes of small data blocks, kind of what is done in bug data
bases, should show another picture as others also have described.

If you look at a single disk, would you get improved performance with
the asyncroneous IO?

I am a bit puzzled about my SATA-II performance: nominally I could get
300 MB/s on SATA-II, but I only get about 80 MB/s. Why is that?
I thought it was because of latency with syncroneous reads.
Ie, when a chunk is read, yo need to complete the IO operation, and then
issue an new one. In the meantime while the CPU is doing these
calculations, te disk has spun a little, and to get the next data chunk,
we need to wait for the disk to spin around to have the head positioned 
over the right data pace on the disk surface. Is that so? Or does the
controller take care of this, reading the rest of the not-yet-requested
track into a buffer, which then can be delivered next time. Modern disks
often have buffers of about 8 or 16 MB. I wonder why they don't have
bigger buffers.

Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Iustin Pop
On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote:
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

Wait, are you talking about a *single* drive?

In that case, it seems you are confusing the interface speed (300MB/s)
with the mechanical read speed (80MB/s). If you are asking why is a
single drive limited to 80 MB/s, I guess it's a problem of mechanics.
Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest
I've seen on 7200 RPM drives. And yes, there is no wait until the CPU
processes the current data until the drive reads the next data; drives
have a builtin read-ahead mechanism.

Honestly, I have 10x as many problems with the low random I/O throughput
rather than with the (high, IMHO) sequential I/O speed.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Thanks for the link!

quote
The second improvement is to remove a memory copy that is internal to the MD 
driver. The MD
driver stages strip data ready to be written next to the I/O controller in a 
page size pre-
allocated buffer. It is possible to bypass this memory copy for sequential 
writes thereby saving
SDRAM access cycles.
/quote

I sure hope you've checked that the filesystem never (ever) changes a
buffer while it is being written out.  Otherwise the data written to
disk might be different from the data used in the parity calculation
:-)

And what are the Second memcpy and First memcpy in the graph?
I assume one is the memcpy mentioned above, but what is the other?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 Keld Jørn Simonsen wrote:
  Hi
 
  I am looking at revising our howto. I see a number of places where a
  chunk size of 32 kiB is recommended, and even recommendations on
  maybe using sizes of 4 kiB. 
 

 Depending on the raid level, a write smaller than the chunk size causes 
 the chunk to be read, altered, and rewritten, vs. just written if the 
 write is a multiple of chunk size. Many filesystems by default use a 4k 
 page size and writes. I believe this is the reasoning behind the 
 suggestion of small chunk sizes. Sequential vs. random and raid level 
 are important here, there's no one size to work best in all cases.

Not in md/raid.

RAID4/5/6 will do a read-modify-write if you are writing less than one
*page*, but then they often to read-modify-write anyway for parity
updates.

No level will every read a whole chunk just because it is a chunk.

To answer the original question:  The only way to be sure is to test
your hardware with your workload with different chunk sizes.
But I suspect that around 256K is good on current hardware.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Thursday February 7, [EMAIL PROTECTED] wrote:
 
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?


Are you serious?

I high end 15000RPM enterprise grade drive such as the Seagate
Cheetah® 15K.6 Hard Drives only deliver 164MB/sec.

The SATA Bus might be able to deliver 300MB/s, but an individual drive
would be around 80MB/s unless it is really expensive.

(or was that yesterday?  I'm having trouble keeping up with the pace
 of improvement :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-05 Thread Justin Piszcz



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB.

My own take on that is that this really hurts performance.
Normal disks have a rotation speed of between 5400 (laptop)
7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms.

in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133
something like between 600 to 1200 kB, actual transfer rates of
80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk
you should have a time of 15 ms per transaction, or 66 transactions per
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50.

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much 
below or too much over that range results in degradation.


Justin.