I'm motivated to make zfs set refreservation=auto do the right thing in the face of raidz and 4k physical blocks, but have data points that provide inconsistent data. Experimentation shows raidz2 parity overhead that matches my expectations for raidz1.
Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12. In the spreadsheet <https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=930519344> from Matt's How I Learned to Stop Worrying and Love RAIDZ <https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz> blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the parity and padding cost is 200%. That is, a 10 gig zvol with volblocksize=4k or 8k should both end up taking up 30 gig of space. Experimentation tells me that they each use just a little bit more than double the amount that was calculated by refreservation=auto. In each of these cases, compression=off and I've overwritten them with `dd if=/dev/zero ...` $ zfs get used,referenced,logicalused,logicalreferenced,volblocksize,refreservation zones/mg/disk0 NAME PROPERTY VALUE SOURCE zones/mg/disk0 used 21.4G - zones/mg/disk0 referenced 21.4G - zones/mg/disk0 logicalused 10.0G - zones/mg/disk0 logicalreferenced 10.0G - zones/mg/disk0 volblocksize 8K default zones/mg/disk0 refreservation 10.3G local $ zfs get used,referenced,logicalused,logicalreferenced,volblocksize,refreservation zones/mg/disk1 NAME PROPERTY VALUE SOURCE zones/mg/disk1 used 21.4G - zones/mg/disk1 referenced 21.4G - zones/mg/disk1 logicalused 10.0G - zones/mg/disk1 logicalreferenced 10.0G - zones/mg/disk1 volblocksize 4K - zones/mg/disk1 refreservation 10.6G local $ zpool status zones pool: zones state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c0t55CD2E404C314E1Ed0 ONLINE 0 0 0 c0t55CD2E404C314E85d0 ONLINE 0 0 0 c0t55CD2E404C315450d0 ONLINE 0 0 0 c0t55CD2E404C31554Ad0 ONLINE 0 0 0 c0t55CD2E404C315BB6d0 ONLINE 0 0 0 c0t55CD2E404C315BCDd0 ONLINE 0 0 0 c0t55CD2E404C315BFDd0 ONLINE 0 0 0 c0t55CD2E404C317724d0 ONLINE 0 0 0 # echo ::spa -c | mdb -k | grep ashift | sort -u ashift=000000000000000c Overwriting from /dev/urandom didn't change the above numbers in any significant way. My understanding is that each volblocksize block has data and parity spread across a minimum of 3 devices so that any two could be lost and still recover. Considering the simple case of volblocksize=4k and ashift=12, 200% overhead for parity (+ no pad) seems spot-on. I seem to be only seeing 100% overhead for parity plus a little for metadata and its parity. What fundamental concept am I missing? TIA, Mike ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M69692d66e50c86b1c23d0e6d Delivery options: https://openzfs.topicbox.com/groups/developer/subscription
