I'm motivated to make zfs set refreservation=auto do the right thing in the
face of raidz and 4k physical blocks, but have data points that provide
inconsistent data.  Experimentation shows raidz2 parity overhead that
matches my expectations for raidz1.

Let's consider the case of a pool with 8 disks in one raidz2 vdev,
ashift=12.

In the spreadsheet
<https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=930519344>
from
Matt's How I Learned to Stop Worrying and Love RAIDZ
<https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz>
blog
entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the parity
and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k or
8k should both end up taking up 30 gig of space.

Experimentation tells me that they each use just a little bit more than
double the amount that was calculated by refreservation=auto.  In each of
these cases, compression=off and I've overwritten them with `dd
if=/dev/zero ...`

$ zfs get
used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
zones/mg/disk0
NAME            PROPERTY           VALUE      SOURCE
zones/mg/disk0  used               21.4G      -
zones/mg/disk0  referenced         21.4G      -
zones/mg/disk0  logicalused        10.0G      -
zones/mg/disk0  logicalreferenced  10.0G      -
zones/mg/disk0  volblocksize       8K         default
zones/mg/disk0  refreservation     10.3G      local
$ zfs get
used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
zones/mg/disk1
NAME            PROPERTY           VALUE      SOURCE
zones/mg/disk1  used               21.4G      -
zones/mg/disk1  referenced         21.4G      -
zones/mg/disk1  logicalused        10.0G      -
zones/mg/disk1  logicalreferenced  10.0G      -
zones/mg/disk1  volblocksize       4K         -
zones/mg/disk1  refreservation     10.6G      local
$ zpool status zones
  pool: zones
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t55CD2E404C314E1Ed0  ONLINE       0     0     0
            c0t55CD2E404C314E85d0  ONLINE       0     0     0
            c0t55CD2E404C315450d0  ONLINE       0     0     0
            c0t55CD2E404C31554Ad0  ONLINE       0     0     0
            c0t55CD2E404C315BB6d0  ONLINE       0     0     0
            c0t55CD2E404C315BCDd0  ONLINE       0     0     0
            c0t55CD2E404C315BFDd0  ONLINE       0     0     0
            c0t55CD2E404C317724d0  ONLINE       0     0     0
# echo ::spa -c | mdb -k | grep ashift | sort -u
            ashift=000000000000000c

Overwriting from /dev/urandom didn't change the above numbers in any
significant way.

My understanding is that each volblocksize block has data and parity spread
across a minimum of 3 devices so that any two could be lost and still
recover.  Considering the simple case of volblocksize=4k and ashift=12,
200% overhead for parity (+ no pad) seems spot-on.  I seem to be only
seeing 100% overhead for parity plus a little for metadata and its parity.

What fundamental concept am I missing?

TIA,
Mike

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M69692d66e50c86b1c23d0e6d
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to