Re: [discuss] double data written to slog?

Andrew Kinney via illumos-discuss Sun, 16 Nov 2014 20:20:07 -0800


Richard Elling via illumos-discuss wrote:

Fair enough. What would be the best way to measure the quantity of data 
actually written to the device?


Most commonly used is:
        iostat -x
or
        iostat -xn

How about dtrace? I'm new to dtrace, so I'm not as proficient in it as Ishould be and may have made a mistake.

root@storage0:/root# dd if=/testpool/randomfile.deletemeof=/testpool/newrandom.deleteme bs=1M oflag=sync count=100

100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.239804 s, 437 MB/s


during that period from a separate shell, I ran:

root@storage0:/root# dtrace -n::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'

dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C

        1
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1601
          262144 |                                         0

This seems to say that 1601 128KiB blocks went to zil during the test.That agrees with the allocations, but not with iostat. It almost seemsas if iostat is leaving something out and there is more activity (seeksmaybe?) than just writes going to the slog.

Would it be a correct interpretation that because 128KiB can't fit128KiB data and 4KiB checksum that it falls back to 64KiB data and 4KiBchecksum, while allocating 128KiB to accommodate that 68KiB?


The following test would seem to bear that out:

root@storage0:/root# dd if=/testpool/randomfile.deletemeof=/testpool/newrandom.deleteme bs=124K oflag=sync count=800

800+0 records in
800+0 records out
101580800 bytes (102 MB) copied, 0.436327 s, 233 MB/s

"iostat -Inx 30" during that interval:

    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    0.0  400.0    0.0 51200.0  0.0  0.0    0.0    0.3   0   0 c5t0d0
    0.0  400.0    0.0 51200.0  0.0  0.0    0.0    0.3   0   0 c5t1d0

0.0 1022.0 0.0 103072.0 0.0 0.0 0.0 0.9 0 3c1t5000CCA05C68D505d00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0c1t5000CCA05C681EB9d00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0c1t5000CCA03B1007E5d00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0c1t5000CCA03B10C085d0

    0.0 1816.0    0.0 205472.0  5.1  0.0   84.4    0.6   3   3 testpool

This time, with 124KiB data and 4KiB checksums, we're getting nice128KiB blocks being written to the slog. Interestingly, allocationsshown for the slog with 'zpool iostat -v testpool' peak at exactly100MiB now, instead of 200MiB.


In spite of that, dtrace still shows a different story from that same test:

root@storage0:/root# dtrace -n::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}'

dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C

        1
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1551
          262144 |                                         0

dtrace shows at total of 198528KiB being sent to zil (2x dd output +128KiB), which, if divided by the time shown by dd puts us nearly bangon at the manufacturer write specification of 465MB/sec (note, not MiBsince manufacturers use SI units). dd, of course, is only reportingabout the data it was writing, hence the 233MB/sec, including some rounding.

So, this begs another question. How is that data being sent to the slog?Are we seeking through the allocated and unused space?

Seeks are supposed to be nearly instantaneous with SSDs unless there issome kind of unreported zero fill or null bytes going on to accomplishthe seek. I seem to recall that back in the '90s there was a widely usedpractice of sending null bytes to seek instead of issuing an actual seekcommand because it was lower latency for rotational disks. Thatassumption is probably not true with SSDs.

This would seem to be a candidate as a better fit explanation for theperformance numbers I'm getting (about half of expected) than the oldstandby of "Slogs are qd=1, so they are always much slower thanmanufacturer specs." Samsung actually quotes qd=1 write performancenumbers for these enterprise SSDs, so I have a fair reference. Mytesting with the raw device supports their numbers. I've gone so far asto turn off cache flushes for these devices in /kernel/drv/sd.conf forthis testing to make best use of the 512MiB power protected cache onthem. I've deliberately kept test sizes smaller than the on-drive cacheto take advantage of this. Essentially, the bottleneck is the CPU on theSSD.


Sincerely,
Andrew Kinney


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] double data written to slog?

Reply via email to