On Sep 13, 2018, at 3:17 AM, fuzzmagnet182 via openzfs-developer
<[email protected]> wrote:> Hi,
>
> A while ago i wrote an article investigating OpenZFS usage for a
> software development development. If you'd like to read it, you can
> view it here:
> http://therandomitblog.blogspot.com/2018/05/zfsdedupe-put-to-ultimate-test-against.html>
>
> My question is, the throughput results for ZFS (especially IOPs) are
> quite high. Can you recommend any changes to my testing which would
> improve the accuracy of the testing?>
> Or are the results of my testing to be expected?
>
> This is the FIO test file i was using during one of the tests.
>
> [global]
> ioengine=libaio
> bs=4k
> # This must be set to 0 for ZFS. 1 for all others.
> direct=0
> # This must be set to none for ZFS. posix for all others.
> fallocate=none
> rw=randrw
> # Make sure fio will refill the IO buffers on every submit rather than
> # just init> refill_buffers
> # Setting to zero in an attempt to stop ZFS from skewing results via
> # de-dupe.> #dedupe_percentage=0
> # Setting to zero in an attempt to stop both ZFS and Nimble from
> # skewing results via compression.> buffer_compress_percentage=0
> norandommap
> randrepeat=0
> rwmixread=70
> runtime=60
> ramp_time=5
> group_reporting
> directory=/zfs/io_testing
> filename=fio_testfile
> time_based=1
> runtime=60
> [16t-rand-write-16q-4k]
> name=4k100writetest-16t-16q
> rw=randrw
> bs=4k
> rwmixread=0
> numjobs=16
> iodepth=16
Hello,
ZFS performance is my specialty, and fio is my weapon of choice.
In the case of benchmarking filesystem performance size does matter.
You don't say what the size of the test files are, but the results you
get lead me to believe everything was fitting nicely in ARC. Hence the
great difference between primarycache=all and metadata/none for the
read tests.
Fio works by laying out it’s test files and then doing I/O to them.
ZFS uses a variable block size for files, with the largest size block it
uses determined by the recordsize property, which defaults to 128k. So
fio lays out a fairly large file (it would be interesting to see how
large the test files are in your case. ls -lah please) and ZFS uses the
largest blocks it can (128k by default) and then fio does I/O using
whatever blocksize you specified. If you specify something smaller than
the zfs recordsize you get I/O amplification (for reads) and
read/modify/write (for writes). The car analogy would be doing 60mph in
second gear. Your car will probably do it, but the engine will be
working really hard to get you those 60 mph.
When it comes to testing size matters. It looks like some of your
testing wasn't really hitting the disks at all, it was being done all in
memory. This means that your results are really testing memory
bandwidth. When you set primarycache=metadata or none you forced the
system to go to the disks. (The 128K sequential read tests were getting
280k IOPs, that's 35GB/sec, which is memory speed.) . The sequential
write tests were getting 4GB/sec, which is faster than a 6 drive RAID 10
of 15k SAS disks can go. Actual write performance of that array would
be somewhere around the 800MB/sec range. Read performance could be
higher if you had two readers doing sequential reads you could in theory
get 1600MB/sec
OpenZFS dedup requires quite a bit of RAM to use effectively. In your
testbed configuration you had 32GB of RAM and 1.4TB of storage. You'd
have great usability in that scenario. However even 3TB of storage
could results in the dedup tables ending up on your spinning disks and
would tank performance.
So some explicit tips:
make sure you zfs set recordsize=<blocksize of fio test> your dataset
before laying out the files for the test. Because fio doesn't use 8K
files when you set the blocksize=8k parameter it will trick ZFS into
laying out files with 128K blocks.
Size your test so it's 2x system RAM. For tests where you are using 16
test files on a system with 32GB of RAM size=4G would be appropriate.
Your RAID controller and disk subsystem can do around 800MB/sec writes
and 1600MB/sec reads if it is sequential I/O that isn't seeking the
disks. Numbers much faster than that are being affected by
cache...which is fine, normal ZFS filesystems are accelerated by cache.
Very rarely are production systems just going to see raw disk
performance. ARC hit rates of 90% or greater are common. Small bursty
async writes go to RAM and aren't limited by disk performance at all.
As soon as you start seeking the disks...15K SAS disks have an average
seek time of 4ms. That's 250 seeks per second. You have 6 disks. 250
* 6 = 1500. So 1500 read IOPs is the upper limit for random reads.
Anything more than that is hitting cache somewhere. The disks have
cache, the RAID controller has cache, ZFS has cache. For writes it's
worse, you only have 3 disks worth of IOPs, so 750 IOPs.
So say you are doing a 4K test, 100% random read. You get that 1500
IOPs (I/O per second) 1500 IOPs * 4KB = 6000KB/s, or ~ 6MB/sec. Pretty
miserable! Thank goodness for SSDs.
Thanks,
Josh Paetzel
------------------------------------------
openzfs: openzfs-developer
Permalink:
https://openzfs.topicbox.com/groups/developer/Ta0c88e6d5333d5dd-M557f2c7157fdbddbe16db704
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription