Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Matthew Ahrens
On Fri, Jun 7, 2019 at 11:06 AM Eric Borisch  wrote:

> On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens 
> wrote:
>
>> The spreadsheet shows how much space will be allocated, which is
>> reflected in the zpool `allocated` property.  However, you are looking at
>> the zfs `used` and `referenced` properties.  These properties (as well as
>> `available` and all other zfs (not zpool) accounting values) take into
>> account the expected RAIDZ overhead, which is calculated assuming 128K
>> logical size blocks.  This means that zfs accounting hides the parity (and
>> padding) overhead when the block size is around 128K.  Other block sizes
>> may see (typically only slightly) more or less space consumed than expected
>> (e.g. if the `recordsize` property has been changed, a 1GB file may have
>> zfs `used` of 0.9G, or 1.1G).
>>
>> As indicated in cell F23, the expected overhead for 4K-sector 8-wide
>> RAIDZ2 is 41% (which is around what the RAID5 overhead would be, 2/6 =
>> 33%).  This is taken into account in the "RAID-Z deflation ratio"
>> (`vdev_deflate_ratio`).  In other words, `used = allocated / 1.41`.  If we
>> undo that, we get `21.4G * 1.41 = 30.2G`, which is around what we expected.
>>
>
> Aha! I've often wondered why I couldn't quite get some values to quite
> line up with what I understood to be occurring on disk. Looks like a
> potential area for improvement;
>

 I agree this is confusing and it's an area we should try to improve!


> for ZVOLs, wouldn't this calculation be better served by considering
> the volblocksize (and associated overhead) of each volume? The 'typically
> only slightly' changes to 'wildly differs' with RAIDZ2/3 and small
> volblocksizes.)
>

I get the idea there, but it isn't very straightforward, because the
deflate ratio is specific to the RAIDZ vdev, which can store different
zvols (with different volblocksizes) and filesystems.

This is intentional - if a zvol that's "using" 1TB has actually allocated
more space than a filesystem that's also "using" 1TB, that would be even
more confusing than the current situation.

If we took into account the additional overhead when calculating each
dataset's "available",  then the confusions would be:
- different datasets have different amounts "available" even if they don't
have quotas/reservations
- a zvol could have 1TB "available", but after writing 1TB it is using an
additional 1.5TB.  So the space "used" and "available" is actually in
different units, and adding up used+available wouldn't make any sense.
Which are less confusing than if it changed the "used", but probably still
not worth it in my opinion.

So I think we might need to do some more brainstorming to come up with
something that is a net improvement on the current situation.

--matt

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M4ff86752be304fa3bdf19752
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Richard Elling


> On Jun 7, 2019, at 12:15 PM, Mike Gerdts  wrote:
> 
> On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens  > wrote:
> On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts  > wrote:
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> 
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> 
> In the spreadsheet 
> 
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> 
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> 
> That makes sense to me as well.
>  
> 
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`
> 
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk0  used   21.4G  -
> zones/mg/disk0  referenced 21.4G  -
> zones/mg/disk0  logicalused10.0G  -
> zones/mg/disk0  logicalreferenced  10.0G  -
> zones/mg/disk0  volblocksize   8K default
> zones/mg/disk0  refreservation 10.3G  local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk1  used   21.4G  -
> zones/mg/disk1  referenced 21.4G  -
> zones/mg/disk1  logicalused10.0G  -
> zones/mg/disk1  logicalreferenced  10.0G  -
> zones/mg/disk1  volblocksize   4K -
> zones/mg/disk1  refreservation 10.6G  local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
> 
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   raidz2-0 ONLINE   0 0 0
> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
> c0t55CD2E404C314E85d0  ONLINE   0 0 0
> c0t55CD2E404C315450d0  ONLINE   0 0 0
> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
> c0t55CD2E404C317724d0  ONLINE   0 0 0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
> ashift=000c
> 
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> 
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on. 
> 
> That's right.  And in the case of volblocksize=8K, you have 2 data + 2 parity 
> + 2 pad = 6 sectors = 24K allocated.
>  
> I seem to be only seeing 100% overhead for parity plus a little for metadata 
> and its parity.
> 
> What fundamental concept am I missing?
> 
> The spreadsheet shows how much space will be allocated, which is reflected in 
> the zpool `allocated` property.  However, you are looking at the zfs `used` 
> and `referenced` properties.  These properties (as well as `available` and 
> all other zfs (not zpool) accounting values) take into account the expected 
> RAIDZ overhead, which is calculated assuming 128K logical size blocks.  This 
> means that zfs accounting hides the parity (and padding) overhead when the 
> block size is around 128K.  Other block sizes may see (typically only 
> slightly) more or less space consumed than expected (e.g. if the `recordsize` 
> property has been changed, a 1GB file may have zfs `used` of 0.9G, or 1.1G).
> 
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide RAIDZ2 
> is 41% (which is around what the RAID5 overhead would be, 2/6 = 33%).  This 
> is taken into account in the "RAID-Z deflation ratio" (`vdev_deflate_ratio`). 
>  In other words, `used = allocated / 1.41`.  If we undo that, we get `21.4G * 
> 1.41 = 30.2G`, which is around what we expected.
> 
> Thanks for that - it should give me 

Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Mike Gerdts
On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens  wrote:

> On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts 
> wrote:
>
>> I'm motivated to make zfs set refreservation=auto do the right thing in
>> the face of raidz and 4k physical blocks, but have data points that provide
>> inconsistent data.  Experimentation shows raidz2 parity overhead that
>> matches my expectations for raidz1.
>>
>> Let's consider the case of a pool with 8 disks in one raidz2 vdev,
>> ashift=12.
>>
>> In the spreadsheet
>> 
>>  from
>> Matt's How I Learned to Stop Worrying and Love RAIDZ
>> 
>>  blog
>> entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the parity
>> and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k or
>> 8k should both end up taking up 30 gig of space.
>>
>
> That makes sense to me as well.
>
>
>>
>> Experimentation tells me that they each use just a little bit more than
>> double the amount that was calculated by refreservation=auto.  In each of
>> these cases, compression=off and I've overwritten them with `dd
>> if=/dev/zero ...`
>>
>> $ zfs get
>> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
>> zones/mg/disk0
>> NAMEPROPERTY   VALUE  SOURCE
>> zones/mg/disk0  used   21.4G  -
>> zones/mg/disk0  referenced 21.4G  -
>> zones/mg/disk0  logicalused10.0G  -
>> zones/mg/disk0  logicalreferenced  10.0G  -
>> zones/mg/disk0  volblocksize   8K default
>> zones/mg/disk0  refreservation 10.3G  local
>> $ zfs get
>> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
>> zones/mg/disk1
>> NAMEPROPERTY   VALUE  SOURCE
>> zones/mg/disk1  used   21.4G  -
>> zones/mg/disk1  referenced 21.4G  -
>> zones/mg/disk1  logicalused10.0G  -
>> zones/mg/disk1  logicalreferenced  10.0G  -
>> zones/mg/disk1  volblocksize   4K -
>> zones/mg/disk1  refreservation 10.6G  local
>> $ zpool status zones
>>   pool: zones
>>  state: ONLINE
>>   scan: none requested
>> config:
>>
>> NAME   STATE READ WRITE CKSUM
>> zones  ONLINE   0 0 0
>>   raidz2-0 ONLINE   0 0 0
>> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
>> c0t55CD2E404C314E85d0  ONLINE   0 0 0
>> c0t55CD2E404C315450d0  ONLINE   0 0 0
>> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
>> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
>> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
>> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
>> c0t55CD2E404C317724d0  ONLINE   0 0 0
>> # echo ::spa -c | mdb -k | grep ashift | sort -u
>> ashift=000c
>>
>> Overwriting from /dev/urandom didn't change the above numbers in any
>> significant way.
>>
>> My understanding is that each volblocksize block has data and parity
>> spread across a minimum of 3 devices so that any two could be lost and
>> still recover.  Considering the simple case of volblocksize=4k and
>> ashift=12, 200% overhead for parity (+ no pad) seems spot-on.
>>
>
> That's right.  And in the case of volblocksize=8K, you have 2 data + 2
> parity + 2 pad = 6 sectors = 24K allocated.
>
>
>> I seem to be only seeing 100% overhead for parity plus a little for
>> metadata and its parity.
>>
>> What fundamental concept am I missing?
>>
>
> The spreadsheet shows how much space will be allocated, which is reflected
> in the zpool `allocated` property.  However, you are looking at the zfs
> `used` and `referenced` properties.  These properties (as well as
> `available` and all other zfs (not zpool) accounting values) take into
> account the expected RAIDZ overhead, which is calculated assuming 128K
> logical size blocks.  This means that zfs accounting hides the parity (and
> padding) overhead when the block size is around 128K.  Other block sizes
> may see (typically only slightly) more or less space consumed than expected
> (e.g. if the `recordsize` property has been changed, a 1GB file may have
> zfs `used` of 0.9G, or 1.1G).
>
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide
> RAIDZ2 is 41% (which is around what the RAID5 overhead would be, 2/6 =
> 33%).  This is taken into account in the "RAID-Z deflation ratio"
> (`vdev_deflate_ratio`).  In other words, `used = allocated / 1.41`.  If we
> undo that, we get `21.4G * 1.41 = 30.2G`, which is around what we expected.
>

Thanks for that - it should give me enough of a clue that I can coax
zvol_volsize_to_reservation() to give a 

Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Eric Borisch
On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens  wrote:

> The spreadsheet shows how much space will be allocated, which is reflected
> in the zpool `allocated` property.  However, you are looking at the zfs
> `used` and `referenced` properties.  These properties (as well as
> `available` and all other zfs (not zpool) accounting values) take into
> account the expected RAIDZ overhead, which is calculated assuming 128K
> logical size blocks.  This means that zfs accounting hides the parity (and
> padding) overhead when the block size is around 128K.  Other block sizes
> may see (typically only slightly) more or less space consumed than expected
> (e.g. if the `recordsize` property has been changed, a 1GB file may have
> zfs `used` of 0.9G, or 1.1G).
>
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide
> RAIDZ2 is 41% (which is around what the RAID5 overhead would be, 2/6 =
> 33%).  This is taken into account in the "RAID-Z deflation ratio"
> (`vdev_deflate_ratio`).  In other words, `used = allocated / 1.41`.  If we
> undo that, we get `21.4G * 1.41 = 30.2G`, which is around what we expected.
>

Aha! I've often wondered why I couldn't quite get some values to quite line
up with what I understood to be occurring on disk. Looks like a potential
area for improvement; for ZVOLs, wouldn't this calculation be better served
by considering the volblocksize (and associated overhead) of each volume?
The 'typically only slightly' changes to 'wildly differs' with RAIDZ2/3 and
small volblocksizes.)

Thanks,
  - Eric

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-Ma5ffcccf239993d3bfe77750
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Matthew Ahrens
On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts  wrote:

> I'm motivated to make zfs set refreservation=auto do the right thing in
> the face of raidz and 4k physical blocks, but have data points that provide
> inconsistent data.  Experimentation shows raidz2 parity overhead that
> matches my expectations for raidz1.
>
> Let's consider the case of a pool with 8 disks in one raidz2 vdev,
> ashift=12.
>
> In the spreadsheet
> 
>  from
> Matt's How I Learned to Stop Worrying and Love RAIDZ
> 
>  blog
> entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the parity
> and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k or
> 8k should both end up taking up 30 gig of space.
>

That makes sense to me as well.


>
> Experimentation tells me that they each use just a little bit more than
> double the amount that was calculated by refreservation=auto.  In each of
> these cases, compression=off and I've overwritten them with `dd
> if=/dev/zero ...`
>
> $ zfs get
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
> zones/mg/disk0
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk0  used   21.4G  -
> zones/mg/disk0  referenced 21.4G  -
> zones/mg/disk0  logicalused10.0G  -
> zones/mg/disk0  logicalreferenced  10.0G  -
> zones/mg/disk0  volblocksize   8K default
> zones/mg/disk0  refreservation 10.3G  local
> $ zfs get
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation
> zones/mg/disk1
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk1  used   21.4G  -
> zones/mg/disk1  referenced 21.4G  -
> zones/mg/disk1  logicalused10.0G  -
> zones/mg/disk1  logicalreferenced  10.0G  -
> zones/mg/disk1  volblocksize   4K -
> zones/mg/disk1  refreservation 10.6G  local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
>
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   raidz2-0 ONLINE   0 0 0
> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
> c0t55CD2E404C314E85d0  ONLINE   0 0 0
> c0t55CD2E404C315450d0  ONLINE   0 0 0
> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
> c0t55CD2E404C317724d0  ONLINE   0 0 0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
> ashift=000c
>
> Overwriting from /dev/urandom didn't change the above numbers in any
> significant way.
>
> My understanding is that each volblocksize block has data and parity
> spread across a minimum of 3 devices so that any two could be lost and
> still recover.  Considering the simple case of volblocksize=4k and
> ashift=12, 200% overhead for parity (+ no pad) seems spot-on.
>

That's right.  And in the case of volblocksize=8K, you have 2 data + 2
parity + 2 pad = 6 sectors = 24K allocated.


> I seem to be only seeing 100% overhead for parity plus a little for
> metadata and its parity.
>
> What fundamental concept am I missing?
>

The spreadsheet shows how much space will be allocated, which is reflected
in the zpool `allocated` property.  However, you are looking at the zfs
`used` and `referenced` properties.  These properties (as well as
`available` and all other zfs (not zpool) accounting values) take into
account the expected RAIDZ overhead, which is calculated assuming 128K
logical size blocks.  This means that zfs accounting hides the parity (and
padding) overhead when the block size is around 128K.  Other block sizes
may see (typically only slightly) more or less space consumed than expected
(e.g. if the `recordsize` property has been changed, a 1GB file may have
zfs `used` of 0.9G, or 1.1G).

As indicated in cell F23, the expected overhead for 4K-sector 8-wide RAIDZ2
is 41% (which is around what the RAID5 overhead would be, 2/6 = 33%).  This
is taken into account in the "RAID-Z deflation ratio"
(`vdev_deflate_ratio`).  In other words, `used = allocated / 1.41`.  If we
undo that, we get `21.4G * 1.41 = 30.2G`, which is around what we expected.

--matt


> TIA,
> Mike
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
> 

Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Richard Elling
> On Jun 6, 2019, at 10:54 PM, Mike Gerdts  wrote:
> 
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> 
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> 
> In the spreadsheet 
> 
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> 
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> 
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`

IIRC, the skip blocks are accounted in the pool's "alloc", but not in the 
dataset's
"used"
 -- richard

> 
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk0  used   21.4G  -
> zones/mg/disk0  referenced 21.4G  -
> zones/mg/disk0  logicalused10.0G  -
> zones/mg/disk0  logicalreferenced  10.0G  -
> zones/mg/disk0  volblocksize   8K default
> zones/mg/disk0  refreservation 10.3G  local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk1  used   21.4G  -
> zones/mg/disk1  referenced 21.4G  -
> zones/mg/disk1  logicalused10.0G  -
> zones/mg/disk1  logicalreferenced  10.0G  -
> zones/mg/disk1  volblocksize   4K -
> zones/mg/disk1  refreservation 10.6G  local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
> 
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   raidz2-0 ONLINE   0 0 0
> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
> c0t55CD2E404C314E85d0  ONLINE   0 0 0
> c0t55CD2E404C315450d0  ONLINE   0 0 0
> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
> c0t55CD2E404C317724d0  ONLINE   0 0 0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
> ashift=000c
> 
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> 
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on.  I seem to be only seeing 100% 
> overhead for parity plus a little for metadata and its parity.
> 
> What fundamental concept am I missing?
> 
> TIA,
> Mike
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M96e6a1eff39c33b730555c19
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription