Richard Elling via illumos-discuss wrote:
Fair enough. What would be the best way to measure the quantity of data
actually written to the device?
Most commonly used is:
iostat -x
or
iostat -xn
How about dtrace? I'm new to dtrace, so I'm not as proficient in it as I
should be and may have made a mistake.
root@storage0:/root# dd if=/testpool/randomfile.deleteme
of=/testpool/newrandom.deleteme bs=1M oflag=sync count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.239804 s, 437 MB/s
during that period from a separate shell, I ran:
root@storage0:/root# dtrace -n
::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'
dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C
1
value ------------- Distribution ------------- count
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1601
262144 | 0
This seems to say that 1601 128KiB blocks went to zil during the test.
That agrees with the allocations, but not with iostat. It almost seems
as if iostat is leaving something out and there is more activity (seeks
maybe?) than just writes going to the slog.
Would it be a correct interpretation that because 128KiB can't fit
128KiB data and 4KiB checksum that it falls back to 64KiB data and 4KiB
checksum, while allocating 128KiB to accommodate that 68KiB?
The following test would seem to bear that out:
root@storage0:/root# dd if=/testpool/randomfile.deleteme
of=/testpool/newrandom.deleteme bs=124K oflag=sync count=800
800+0 records in
800+0 records out
101580800 bytes (102 MB) copied, 0.436327 s, 233 MB/s
"iostat -Inx 30" during that interval:
r/i w/i kr/i kw/i wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 rpool
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 400.0 0.0 51200.0 0.0 0.0 0.0 0.3 0 0 c5t0d0
0.0 400.0 0.0 51200.0 0.0 0.0 0.0 0.3 0 0 c5t1d0
0.0 1022.0 0.0 103072.0 0.0 0.0 0.0 0.9 0 3
c1t5000CCA05C68D505d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
c1t5000CCA05C681EB9d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
c1t5000CCA03B1007E5d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
c1t5000CCA03B10C085d0
0.0 1816.0 0.0 205472.0 5.1 0.0 84.4 0.6 3 3 testpool
This time, with 124KiB data and 4KiB checksums, we're getting nice
128KiB blocks being written to the slog. Interestingly, allocations
shown for the slog with 'zpool iostat -v testpool' peak at exactly
100MiB now, instead of 200MiB.
In spite of that, dtrace still shows a different story from that same test:
root@storage0:/root# dtrace -n
::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'
dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C
1
value ------------- Distribution ------------- count
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1551
262144 | 0
dtrace shows at total of 198528KiB being sent to zil (2x dd output +
128KiB), which, if divided by the time shown by dd puts us nearly bang
on at the manufacturer write specification of 465MB/sec (note, not MiB
since manufacturers use SI units). dd, of course, is only reporting
about the data it was writing, hence the 233MB/sec, including some rounding.
So, this begs another question. How is that data being sent to the slog?
Are we seeking through the allocated and unused space?
Seeks are supposed to be nearly instantaneous with SSDs unless there is
some kind of unreported zero fill or null bytes going on to accomplish
the seek. I seem to recall that back in the '90s there was a widely used
practice of sending null bytes to seek instead of issuing an actual seek
command because it was lower latency for rotational disks. That
assumption is probably not true with SSDs.
This would seem to be a candidate as a better fit explanation for the
performance numbers I'm getting (about half of expected) than the old
standby of "Slogs are qd=1, so they are always much slower than
manufacturer specs." Samsung actually quotes qd=1 write performance
numbers for these enterprise SSDs, so I have a fair reference. My
testing with the raw device supports their numbers. I've gone so far as
to turn off cache flushes for these devices in /kernel/drv/sd.conf for
this testing to make best use of the 512MiB power protected cache on
them. I've deliberately kept test sizes smaller than the on-drive cache
to take advantage of this. Essentially, the bottleneck is the CPU on the
SSD.
Sincerely,
Andrew Kinney
-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription:
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com