I have posted this on FreeNAS forums, and it was suggested to raise it with 
OpenZFS so I'm raising it here and maybe you can have a look and help fix it. I 
have also raised it with FreeNAS initially.
I have a problem with my L2ARC.


It seems to have reads capped at ~175MB/s.
It's a 2TB Intel SSD - P3520, PCIe based NVMe drive.

The problem is, that, if data is read from ARC, I get line speed at 10Gbps 
interface (about 1.1GB/s via NFS).
When data is read from L2ARC, I get between 250-300MB/s, only because it also 
partially reads from disks.

I have tried it both with vfs.zfs.l2arc_noprefetch set to 0 and 1, although it 
is more prominent with it being set to 0.
I've done some tests and I've ran everything 3 times and gave you the middle 
results, on a lightly used server.
When data is in ARC and filesystem is set to primarycache=all and 
secondarycache=all

# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null 
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 10.6273 s, 901 MB/s




When filesystem is set to primarycache=none and secondarycache=none:


# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null 
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 44.2833 s, 216 MB/s




When filesystem is set to primarycache=metadata and secondarycache=none and 
metadata is in ARC:


# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null 
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 25.4434 s, 376 MB/s




When filesystem is set to primarycache=none and secondarycache=metadata and 
metadata is in L2ARC:


# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null 
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.1836 s, 180 MB/s




When filesystem is set to primarycache=none and secondarycache=all and metadata 
and data are in L2ARC:


# echo 3 > /proc/sys/vm/drop_caches;dd if=/test/randomfile.bin of=/dev/null 
bs=1M
9129+1 records in
9129+1 records out
9573093376 bytes (9.6 GB, 8.9 GiB) copied, 53.8619 s, 178 MB/s

Also, maximum number of read transactions I've ever seen on this cache is 1.52K
I've seen 3K on spinning disks, so just wondering why?

cache                                       -      -      -      -      -      -
  gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0   141G  1.68T  1.52K      0   195M 
     0


Basically, when ever the read stopped using ARC and went to use the cache disk, 
I could observer about 180MB/s


I have been doing further tests and came to a disturbing conclusion. *
As it went:
 * have tested the vfs.zfs.arc_average_blocksize at 8k, 64, 256k and 512k (with 
reboots), with no noticeable changes in behavior.
 * have noticed that although I managed to write a lot to the cache - (even 
900MB/s, using large vfs.zfs.l2arc_write_boost and vfs.zfs.l2arc_write_max), I 
was always able to get ~185MB/s on average out of it (rarely above 190MB/s)
It was doing about 1.5k read IOPS, so thought that maybe there is a limit to 
IOPS somewhere in the code dealing with cache.
To check this theory, I changed the recordsize on my test filesystem to 1M. 
Then reads which used L2ARC cache from this filesystem still are limited at 
185MB/s but with only 185 read IOPS.
I then researched the idea that the code dealing with NVME may be broken 
somewhere and found this thread:
https://lists.freebsd.org/pipermail/svn-src-head/2016-March/083406.html
(Do read the entire thread - very interesting, however it seems that this 
change is not there (so there is no nvme_max_optimal_sectorsize) :(

However the most important part of the read was this, I think:
Code:
/*
* Intel NVMe controllers have a slow path for I/Os that span a 128KB
- * stripe boundary but ZFS limits ashift, which is derived from
- * d_stripesize, to 13 (8KB) so we limit the stripesize reported to
- * geom(8) to 4KB by default.
- *
- * This may result in a small number of additional I/Os to require
- * splitting in nvme(4), however the NVMe I/O path is very efficient
- * so these additional I/Os will cause very minimal (if any) difference
- * in performance or CPU utilisation.
- */

*However, this pointed me towards checking how actual nvme driver for those 
Intel cards behaves, so I've done some tests.*


# nvmecontrol perftest -n 16 -o read -s 512 -t 30 nvme0ns1
Threads: 16 Size:    512  READ Time:  30 IO/s:  289317 MB/s:  141
# nvmecontrol perftest -n 16 -o read -s 4096 -t 30 nvme0ns1
Threads: 16 Size:   4096  READ Time:  30 IO/s:  297781 MB/s: 1163
# nvmecontrol perftest -n 1 -o read -s 4096 -t 30 nvme0ns1
Threads:  1 Size:   4096  READ Time:  30 IO/s:   26887 MB/s:  105
# nvmecontrol perftest -n 1 -o read -s 65536 -t 30 nvme0ns1
Threads:  1 Size:  65536  READ Time:  30 IO/s:    3755 MB/s:  234
# nvmecontrol perftest -n 16 -o read -s 65536 -t 30 nvme0ns1
Threads: 16 Size:  65536  READ Time:  30 IO/s:   42393 MB/s: 2649
# nvmecontrol perftest -n 16 -o read -s 65536 -t 10 nvme0ns1
Threads: 16 Size:  65536  READ Time:  10 IO/s:   36372 MB/s: 2273
# nvmecontrol perftest -n 16 -o read -s 4096 -t 10 nvme0ns1
Threads: 16 Size:   4096  READ Time:  10 IO/s:  306383 MB/s: 1196
# nvmecontrol perftest -n 8 -o read -s 4096 -t 10 nvme0ns1
Threads:  8 Size:   4096  READ Time:  10 IO/s:  327323 MB/s: 1278
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads:  1 Size:   8192  READ Time:  10 IO/s:   23705 MB/s:  185

Then it struck me - 185MB/s - that's what I keep seeing.


So I did some more testing:


# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads:  1 Size:   8192  READ Time:  10 IO/s:   22810 MB/s:  178
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads:  2 Size:   8192  READ Time:  10 IO/s:  133169 MB/s: 1040


Wait a minute! single thread gets 178MB/s but two threads get six time as much 
throughput?!?!?!
So, I'm checking further:
# nvmecontrol perftest -n 1 -o read -s 8192 -t 10 nvme0ns1
Threads:  1 Size:   8192  READ Time:  10 IO/s:   22761 MB/s:  177
# nvmecontrol perftest -n 2 -o read -s 8192 -t 10 nvme0ns1
Threads:  2 Size:   8192  READ Time:  10 IO/s:  129932 MB/s: 1015
# nvmecontrol perftest -n 4 -o read -s 8192 -t 10 nvme0ns1
Threads:  4 Size:   8192  READ Time:  10 IO/s:  235470 MB/s: 1839
# nvmecontrol perftest -n 8 -o read -s 8192 -t 10 nvme0ns1
Threads:  8 Size:   8192  READ Time:  10 IO/s:  232484 MB/s: 1816

Looks like I hit IOPS limit on the device. Time to check 64k reads:


# nvmecontrol perftest -n 1 -o read -s 65536 -t 10 nvme0ns1
Threads:  1 Size:  65536  READ Time:  10 IO/s:    3664 MB/s:  229
# nvmecontrol perftest -n 2 -o read -s 65536 -t 10 nvme0ns1
Threads:  2 Size:  65536  READ Time:  10 IO/s:   35995 MB/s: 2249
# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
Threads:  4 Size:  65536  READ Time:  10 IO/s:   48641 MB/s: 3040
# nvmecontrol perftest -n 8 -o read -s 65536 -t 10 nvme0ns1
Threads:  8 Size:  65536  READ Time:  10 IO/s:   47784 MB/s: 2986

Wow, multi-threading to the rescue, really - ten times more throughput while 
using two threads! on 64k chunks.


I have also checked it for 128k:


# nvmecontrol perftest -n 1 -o read -s 131072 -t 10 nvme0ns1
Threads:  1 Size: 131072  READ Time:  10 IO/s:    2276 MB/s:  284
# nvmecontrol perftest -n 2 -o read -s 131072 -t 10 nvme0ns1
Threads:  2 Size: 131072  READ Time:  10 IO/s:   20692 MB/s: 2586
# nvmecontrol perftest -n 4 -o read -s 131072 -t 10 nvme0ns1
Threads:  4 Size: 131072  READ Time:  10 IO/s:   25473 MB/s: 3184
# nvmecontrol perftest -n 8 -o read -s 131072 -t 10 nvme0ns1
Threads:  8 Size: 131072  READ Time:  10 IO/s:   23149 MB/s: 2893

Same behavior observed.
And for 256k, but here something I expected:

# nvmecontrol perftest -n 1 -o read -s 262144 -t 1 nvme0ns1
Threads:  1 Size: 262144  READ Time:   1 IO/s:       1 MB/s:    0



Zero. Can't read it this way. This is Inter hardware related.



Now. I've done some dd reads from my Intel SSD, just to see how it behaves via 
nvd driver.


# dd if=/dev/nvd0 of=/dev/null bs=8192 count=10000
10000+0 records in
10000+0 records out
81920000 bytes transferred in 0.171515 secs (477624808 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes transferred in 1.162961 secs (563527174 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 2.283765 secs (573929503 bytes/sec)
# dd if=/dev/nvd0 of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 18.330676 secs (572033472 bytes/sec)

It, for some reason reaches ceiling of ~600MB/s. Not perfect, but I would be 
more satisfied with it than with 185MB/s (which I get).


I have done one more test, based on presumption that if my recordsize is 128k, 
there is a chance that actual reads from L2ARC are actually chopped, into 
smaller chunks as this may have some overhead for metadata.


So, I have changed recordsize to 64k, and then ran my test which filled up 
L2ARC, and continued with my read test.
At moments I was seeing the throughput to reach 287MB/s from cache

cache                                       -      -      -      -      -      -
  gptid/a3ee0c74-46e1-11e8-86ce-ac1f6b0c24d0   471G  1.36T  4.48K     14   287M 
 13.1M
--------------------------------------  -----  -----  -----  -----  -----  -----


This sort of aligns itself with actual 128k sector reads from testing below 
(for single thread).


*The question is: Is L2ARC read process single-threaded?*


How can we check? Can this be checked by you?


And is it possible to use more than one thread (miminum 2, preferably 4-8) ???
(would be nice to have sysctl variable for it)

Also, I tested this behavior on another Interl NVMe disk (P4500) and have 
observed similar limitation.
Additionally, one of the FreeNAS devs tested it on his HGST SSD and also got 
such a big difference.

Can it be addressed within OpenZFS or would this need to go to FreeBSD?

Regards,
Wojciech Kruzel

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M67d5f13a091f89e59671ee2b
Delivery options: https://openzfs.topicbox.com/groups

Reply via email to