Richard Elling via illumos-discuss wrote:
Fair enough. What would be the best way to measure the quantity of data 
actually written to the device?

Most commonly used is:
        iostat -x
or
        iostat -xn


How about dtrace? I'm new to dtrace, so I'm not as proficient in it as I should be and may have made a mistake.

root@storage0:/root# dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=1M oflag=sync count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.239804 s, 437 MB/s


during that period from a separate shell, I ran:
root@storage0:/root# dtrace -n ::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'
dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C

        1
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1601
          262144 |                                         0


This seems to say that 1601 128KiB blocks went to zil during the test. That agrees with the allocations, but not with iostat. It almost seems as if iostat is leaving something out and there is more activity (seeks maybe?) than just writes going to the slog.

Would it be a correct interpretation that because 128KiB can't fit 128KiB data and 4KiB checksum that it falls back to 64KiB data and 4KiB checksum, while allocating 128KiB to accommodate that 68KiB?

The following test would seem to bear that out:

root@storage0:/root# dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=124K oflag=sync count=800
800+0 records in
800+0 records out
101580800 bytes (102 MB) copied, 0.436327 s, 233 MB/s

"iostat -Inx 30" during that interval:

    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    0.0  400.0    0.0 51200.0  0.0  0.0    0.0    0.3   0   0 c5t0d0
    0.0  400.0    0.0 51200.0  0.0  0.0    0.0    0.3   0   0 c5t1d0
0.0 1022.0 0.0 103072.0 0.0 0.0 0.0 0.9 0 3 c1t5000CCA05C68D505d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA05C681EB9d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B1007E5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B10C085d0
    0.0 1816.0    0.0 205472.0  5.1  0.0   84.4    0.6   3   3 testpool


This time, with 124KiB data and 4KiB checksums, we're getting nice 128KiB blocks being written to the slog. Interestingly, allocations shown for the slog with 'zpool iostat -v testpool' peak at exactly 100MiB now, instead of 200MiB.

In spite of that, dtrace still shows a different story from that same test:

root@storage0:/root# dtrace -n ::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'
dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C

        1
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1551
          262144 |                                         0

dtrace shows at total of 198528KiB being sent to zil (2x dd output + 128KiB), which, if divided by the time shown by dd puts us nearly bang on at the manufacturer write specification of 465MB/sec (note, not MiB since manufacturers use SI units). dd, of course, is only reporting about the data it was writing, hence the 233MB/sec, including some rounding.


So, this begs another question. How is that data being sent to the slog? Are we seeking through the allocated and unused space?

Seeks are supposed to be nearly instantaneous with SSDs unless there is some kind of unreported zero fill or null bytes going on to accomplish the seek. I seem to recall that back in the '90s there was a widely used practice of sending null bytes to seek instead of issuing an actual seek command because it was lower latency for rotational disks. That assumption is probably not true with SSDs.

This would seem to be a candidate as a better fit explanation for the performance numbers I'm getting (about half of expected) than the old standby of "Slogs are qd=1, so they are always much slower than manufacturer specs." Samsung actually quotes qd=1 write performance numbers for these enterprise SSDs, so I have a fair reference. My testing with the raw device supports their numbers. I've gone so far as to turn off cache flushes for these devices in /kernel/drv/sd.conf for this testing to make best use of the 512MiB power protected cache on them. I've deliberately kept test sizes smaller than the on-drive cache to take advantage of this. Essentially, the bottleneck is the CPU on the SSD.

Sincerely,
Andrew Kinney


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to