> On Jun 6, 2019, at 10:54 PM, Mike Gerdts <[email protected]> wrote: > > I'm motivated to make zfs set refreservation=auto do the right thing in the > face of raidz and 4k physical blocks, but have data points that provide > inconsistent data. Experimentation shows raidz2 parity overhead that matches > my expectations for raidz1. > > Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12. > > In the spreadsheet > <https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=930519344> > from Matt's How I Learned to Stop Worrying and Love RAIDZ > <https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz> > blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the > parity and padding cost is 200%. That is, a 10 gig zvol with volblocksize=4k > or 8k should both end up taking up 30 gig of space. > > Experimentation tells me that they each use just a little bit more than > double the amount that was calculated by refreservation=auto. In each of > these cases, compression=off and I've overwritten them with `dd if=/dev/zero > ...`
IIRC, the skip blocks are accounted in the pool's "alloc", but not in the dataset's "used" -- richard > > $ zfs get > used,referenced,logicalused,logicalreferenced,volblocksize,refreservation > zones/mg/disk0 > NAME PROPERTY VALUE SOURCE > zones/mg/disk0 used 21.4G - > zones/mg/disk0 referenced 21.4G - > zones/mg/disk0 logicalused 10.0G - > zones/mg/disk0 logicalreferenced 10.0G - > zones/mg/disk0 volblocksize 8K default > zones/mg/disk0 refreservation 10.3G local > $ zfs get > used,referenced,logicalused,logicalreferenced,volblocksize,refreservation > zones/mg/disk1 > NAME PROPERTY VALUE SOURCE > zones/mg/disk1 used 21.4G - > zones/mg/disk1 referenced 21.4G - > zones/mg/disk1 logicalused 10.0G - > zones/mg/disk1 logicalreferenced 10.0G - > zones/mg/disk1 volblocksize 4K - > zones/mg/disk1 refreservation 10.6G local > $ zpool status zones > pool: zones > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > zones ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c0t55CD2E404C314E1Ed0 ONLINE 0 0 0 > c0t55CD2E404C314E85d0 ONLINE 0 0 0 > c0t55CD2E404C315450d0 ONLINE 0 0 0 > c0t55CD2E404C31554Ad0 ONLINE 0 0 0 > c0t55CD2E404C315BB6d0 ONLINE 0 0 0 > c0t55CD2E404C315BCDd0 ONLINE 0 0 0 > c0t55CD2E404C315BFDd0 ONLINE 0 0 0 > c0t55CD2E404C317724d0 ONLINE 0 0 0 > # echo ::spa -c | mdb -k | grep ashift | sort -u > ashift=000000000000000c > > Overwriting from /dev/urandom didn't change the above numbers in any > significant way. > > My understanding is that each volblocksize block has data and parity spread > across a minimum of 3 devices so that any two could be lost and still > recover. Considering the simple case of volblocksize=4k and ashift=12, 200% > overhead for parity (+ no pad) seems spot-on. I seem to be only seeing 100% > overhead for parity plus a little for metadata and its parity. > > What fundamental concept am I missing? > > TIA, > Mike > openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see > discussions <https://openzfs.topicbox.com/groups/developer> + participants > <https://openzfs.topicbox.com/groups/developer/members> + delivery options > <https://openzfs.topicbox.com/groups/developer/subscription>Permalink > <https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M69692d66e50c86b1c23d0e6d> ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M96e6a1eff39c33b730555c19 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription
