I have posted this on FreeNAS forums, and it was suggested to raise it with
OpenZFS so I'm raising it here and maybe you can have a look and help fix it. I
have also raised it with FreeNAS initially.
I have a problem with my L2ARC.
It seems to have reads capped at ~175MB/s.
It's a 2TB Intel SSD - P3520, PCIe based NVMe drive.
The problem is, that, if data is read from ARC, I get line speed at 10Gbps
interface (about 1.1GB/s via NFS).
When data is read from L2ARC, I get between 250-300MB/s, only because it also
partially reads from disks.
I have tried it both with vfs.zfs.l2arc_noprefetch set to 0 and 1, although it
is more prominent with it being set to 0.
I've done some tests and I've ran everything 3 times and gave you the middle
results, on a lightly used server.
When data is in ARC and filesystem is set to primarycache=all and
secondarycache=all
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 10.6273 s, 901 MB/s
When filesystem is set to primarycache=none and secondarycache=none:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 44.2833 s, 216 MB/s
When filesystem is set to primarycache=metadata and secondarycache=none and
metadata is in ARC:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 25.4434 s, 376 MB/s
When filesystem is set to primarycache=none and secondarycache=metadata and
metadata is in L2ARC:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.1836 s, 180 MB/s
When filesystem is set to primarycache=none and secondarycache=all and metadata
and data are in L2ARC:
# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.8619 s, 178 MB/s
Also, maximum number of read transactions I've ever seen on this cache is 1.52K
I've seen 3K on spinning disks, so just wondering why?
cache - - - - - -
gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0 141G 1.68T 1.52K 0 195M
0
Basically, when ever the read stopped using ARC and went to use the cache disk,
I could observer about 180MB/s
I have been doing further tests and came to a disturbing conclusion. *
As it went:
* have tested the vfs.zfs.arc_average_blocksize at 8k, 64, 256k and 512k (with
reboots), with no noticeable changes in behavior.
* have noticed that although I managed to write a lot to the cache - (even
900MB/s, using large vfs.zfs.l2arc_write_boost and vfs.zfs.l2arc_write_max), I
was always able to get ~185MB/s on average out of it (rarely above 190MB/s)
It was doing about 1.5k read IOPS, so thought that maybe there is a limit to
IOPS somewhere in the code dealing with cache.
To check this theory, I changed the recordsize on my test filesystem to 1M.
Then reads which used L2ARC cache from this filesystem still are limited at
185MB/s but with only 185 read IOPS.
I then researched the idea that the code dealing with NVME may be broken
somewhere and found this thread:
https://lists.freebsd.org/pipermail/svn-src-head/2016-March/083406.html
(Do read the entire thread - very interesting, however it seems that this
change is not there (so there is no nvme_max_optimal_sectorsize) :(
However the most important part of the read was this, I think:
Code:
/*
* Intel NVMe controllers have a slow path for I/Os that span a 128KB
- * stripe boundary but ZFS limits ashift, which is derived from
- * d_stripesize, to 13 (8KB) so we limit the stripesize reported to
- * geom(8) to 4KB by default.
- *
- * This may result in a small number of additional I/Os to require
- * splitting in nvme(4), however the NVMe I/O path is very efficient
- * so these additional I/Os will cause very minimal (if any) difference
- * in performance or CPU utilisation.
- */
*However, this pointed me towards checking how actual nvme driver for those
Intel cards behaves, so I've done some tests.*
# nvmecontrol perftest -n 16 -o read -s 512 -t 30 nvme0ns1
Threads: 16 Size: 512 READ Time: 30 IO/s: 289317 MB/s: 141
# nvmecontrol perftest -n 16 -o read -s 4096 -t 30 nvme0ns1
Threads: 16 Size: 4096 READ Time: 30 IO/s: 297781 MB/s: 1163
# nvmecontrol perftest -n 1 -o read -s 4096 -t 30 nvme0ns1
Threads: 1 Size: 4096 READ Time: 30 IO/s: 26887 MB/s: 105
# nvmecontrol perftest -n 1 -o read -s 65536 -t 30 nvme0ns1
Threads: 1 Size: 65536 READ Time: 30 IO/s: 3755 MB/s: 234
# nvmecontrol perftest -n 16 -o read -s 65536 -t 30 nvme0ns1
Threads: 16 Size: 65536 READ Time: 30 IO/s: 42393 MB/s: 2649
# nvmecontrol perftest -n 16 -o read -s 65536 -t 10 nvme0ns1
Threads: 16 Size: 65536 READ Time: 10 IO/s: 36372 MB/s: 2273
# nvmecontrol perftest -n 16 -o read -s 4096 -t 10 nvme0ns1
Threads: 16 Size: 4096 READ Time: 10 IO/s: 306383 MB/s: 1196
# nvmecontrol perftest -n 8 -o read -s 4096 -t 10 nvme0ns1
Threads: 8 Size: 4096 READ Time: 10 IO/s: 327323 MB/s: 1278
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 23705 MB/s: 185
Then it struck me - 185MB/s - that's what I keep seeing.
So I did some more testing:
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 22810 MB/s: 178
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads: 2 Size: 8192 READ Time: 10 IO/s: 133169 MB/s: 1040
Wait a minute! single thread gets 178MB/s but two threads get six time as much
throughput?!?!?!
So, I'm checking further:
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads: 1 Size: 8192 READ Time: 10 IO/s: 22761 MB/s: 177
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads: 2 Size: 8192 READ Time: 10 IO/s: 129932 MB/s: 1015
# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1
Threads: 4 Size: 8192 READ Time: 10 IO/s: 235470 MB/s: 1839
# nvmecontrol perftest -n 8 -o read -s 8192 -t 10 nvme0ns1
Threads: 8 Size: 8192 READ Time: 10 IO/s: 232484 MB/s: 1816
Looks like I hit IOPS limit on the device. Time to check 64k reads:
# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1
Threads: 1 Size: 65536 READ Time: 10 IO/s: 3664 MB/s: 229
# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1
Threads: 2 Size: 65536 READ Time: 10 IO/s: 35995 MB/s: 2249
# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
Threads: 4 Size: 65536 READ Time: 10 IO/s: 48641 MB/s: 3040
# nvmecontrol perftest -n 8 -o read -s 65536 -t 10 nvme0ns1
Threads: 8 Size: 65536 READ Time: 10 IO/s: 47784 MB/s: 2986
Wow, multi-threading to the rescue, really - ten times more throughput while
using two threads! on 64k chunks.
I have also checked it for 128k:
# nvmecontrol perftest -n 1 -o read -s 131072 -t 10 nvme0ns1
Threads: 1 Size: 131072 READ Time: 10 IO/s: 2276 MB/s: 284
# nvmecontrol perftest -n 2 -o read -s 131072 -t 10 nvme0ns1
Threads: 2 Size: 131072 READ Time: 10 IO/s: 20692 MB/s: 2586
# nvmecontrol perftest -n 4 -o read -s 131072 -t 10 nvme0ns1
Threads: 4 Size: 131072 READ Time: 10 IO/s: 25473 MB/s: 3184
# nvmecontrol perftest -n 8 -o read -s 131072 -t 10 nvme0ns1
Threads: 8 Size: 131072 READ Time: 10 IO/s: 23149 MB/s: 2893
Same behavior observed.
And for 256k, but here something I expected:
# nvmecontrol perftest -n 1 -o read -s 262144 -t 1 nvme0ns1
Threads: 1 Size: 262144 READ Time: 1 IO/s: 1 MB/s: 0
Zero. Can't read it this way. This is Inter hardware related.
Now. I've done some dd reads from my Intel SSD, just to see how it behaves via
nvd driver.
# dd if=/dev/nvd0 of=/dev/null bs=8192 count=10000
10000+0 records in
10000+0 records out
81920000 bytes transferred in 0.171515 secs (477624808 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes transferred in 1.162961 secs (563527174 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 2.283765 secs (573929503 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 18.330676 secs (572033472 bytes/sec)
It, for some reason reaches ceiling of ~600MB/s. Not perfect, but I would be
more satisfied with it than with 185MB/s (which I get).
I have done one more test, based on presumption that if my recordsize is 128k,
there is a chance that actual reads from L2ARC are actually chopped, into
smaller chunks as this may have some overhead for metadata.
So, I have changed recordsize to 64k, and then ran my test which filled up
L2ARC, and continued with my read test.
At moments I was seeing the throughput to reach 287MB/s from cache
cache - - - - - -
gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0 471G 1.36T 4.48K 14 287M
13.1M
-------------------------------------- ----- ----- ----- ----- ----- -----
This sort of aligns itself with actual 128k sector reads from testing below
(for single thread).
*The question is: Is L2ARC read process single-threaded?*
How can we check? Can this be checked by you?
And is it possible to use more than one thread (miminum 2, preferably 4-8) ???
(would be nice to have sysctl variable for it)
Also, I tested this behavior on another Interl NVMe disk (P4500) and have
observed similar limitation.
Additionally, one of the FreeNAS devs tested it on his HGST SSD and also got
such a big difference.
Can it be addressed within OpenZFS or would this need to go to FreeBSD?
Regards,
Wojciech Kruzel
------------------------------------------
openzfs: openzfs-developer
Permalink:
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M67d5f13a091f89e59671ee2b
Delivery options: https://openzfs.topicbox.com/groups