Hi Eric,

Thanks for the information. 

I am aware of the recsize option and its intended use. However, when I was exploring it to confirm the expected behavior, what I found was the opposite!

The test case was build 38,  Solaris 11,  a 2 GB file, initially created with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1,  accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which ever came first.  The result at the pool level was 78% of the operations were RR, all overhead.  For the same test, with a 128 KB recsize (the default),  the pool access was pure SW, beautiful.  I ran this test 5 times. The test results with an 8 KB recsize were consistent, however ONE of the 128 KB recsize tests did have 62% RR at the pool level....this is not exactly a confidence builder for predictability.

As I understand the striping logic is separate from the on disk format and can be changed in the future, I would suggest a variant of raid-z (raid-z+) that would have a variable stripe width instead of a variable stripe unit. The worst case would be 1+1, but you would generally do better than mirroring in terms the number of drives used for protection, and you could avoid dividing an 8 KB I/O over say 5, 10 or (god forbid) 47 drives. It would be much less overhead, something like 200 to 1 in one analysis (if I recall correctly), and hence much better performance.

I will be happy to post ORtera summary reports for a pair of these tests if you would like to see the numbers.  However, the forum would be the better place to post the reports.

Regards,
Dave



Eric Schrock wrote:
On Wed, Aug 09, 2006 at 03:29:05PM -0700, Dave Fisk wrote:
  
For example the COW may or may not have to read old data for a small
I/O update operation, and a large portion of the pool vdev capability
can be spent on this kind of overhead.
    

This is what the 'recordsize' property is for.  If you have a workload
that works on large files in very small sized chunks, setting the
recordsize before creating the files will result in a big improvement.

  
Also, on read, if the pattern is random, you may or may not
receive any benefit from the 32 KB to 128 KB reads on each disk of the
pool vdev on behalf of a small read, say 8 KB by the application,
again lots of overhead potential.
    

We're evaluating the tradeoffs on this one.  The original vdev cache has
been around forever, and hasn't really been reevaluated in the context
of the latest improvements.  See:

6437054 vdev_cache: wise up or die

The DMU-level prefetch code had to undergo a similar overhaul, and was
fixed up in build 45.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

  
-- 
Dave Fisk, ORtera Inc.
Phone (562) 433-7078
[EMAIL PROTECTED]
http://www.ORtera.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to