Re: [developer] DRAID impressions

Phil Harman Wed, 07 Dec 2022 02:27:23 -0800

A little more background...

The choice of a pool with 4x raid2:3d has more to do with a sweet spot for
our use case and budget.


I have twenty or so small NAS appliances, currently each with 5x 3.8TB
enterprise SATA SSDs and an Optane log. These each serve a handful of
datasets via NFSv3 to a pool of ESXi heads over 10gbe.

The NAS servers send in regular incremental ZFS send streams to a backup
server over gigabit WAN links, which my code then replicates to the other
backup servers.

The backup servers use 3.5" SAS drives in good old Supermicro JBODs (which
came from our old HA NAS appliances), with no SLOG or L2ARC and with
primarycache=metadata.

4x raid2:3d turned out to be a sweet spot for achieving line speed when
propagating multiple zfs send streams over 10gbe.

It also turns out that we probably don't need it to be quite so fast or
safe, so we could reconfigure later down the path to get more space for no
extra cost.

We only ever have three backup servers online at any one time, and
resyncing a server from scratch only takes a day or so, so reconfiguring
pools (or just rebuilding to refrag etc) are really easy to achieve.

In addition to the bulk backups, the NAS servers each retain some snapshots
for local data recovery without needing to resort to the bulk backups.

We currently make extensive use of Storage vMotion. In the future we may
switch to NFSv4.1 to allow multiple ZFS datasets behind one NFS mount
(managing lots of volumes in ESXi is a pain), at which point the number of
datasets will increase by at least an order of magnitude.

Each NAS server maintains a sqlite3 database with a table holding a
complete dump of its current 'zfs get all' state. Other tables hold 'find .
-ls' for large files, and JSON output from smartctl and lsblk etc. These
are then merged centrally and used to drive snapshot creation, replication,
pruning etc as discrete tasks, as well as for monitoring and compliance
assurance.

As an aside: it would be amazing to have things like 'zfs/zpool get all'
available through, say, a fast VFS, oh and in JSON format. For extra marks,
your API should include the ability to take snapshots of the state and a
means of determining what has changed between snapshots :)

Phil

On Tue, 6 Dec 2022 at 21:59, Phil Harman <phil.har...@gmail.com> wrote:

> Hi Richard, a very long time indeed!
>
> I was thinking more of the accumulated allocation (not instantaneous IOPS).
>
> The pool was created with both DRAID vdevs from the get-go. We had 32x
> 12TB and 12x 14TB drives to repurpose, and wanted to build two identical
> pools.
>
> Our other two servers have 16x new 18TB. Interestingly, (and more by luck
> than design) the usable space across all four servers is the same.
>
> In this pool, both vdevs use a 3d+2p RAID scheme. But whereas
> draid2:3d:16c:1s-0 has three 3d+2p, plus one spare (16x
> 12TB), draid2:3d:6c:1s-1 has just one 3d+2p, plus one spare (6x 14TB
> drives).
>
> The disappointing thing is the lop-sided allocation,
> such that draid2:3d:6c:1s-1 has 50.6T allocated and 3.96T free (i.e. 92%
> allocated), whereas raid2:3d:16c:1s-0 only has 128T allocated, and 63.1T
> free (i.e. about 50% allocated).
>
> Phil
>
> On Tue, 6 Dec 2022 at 13:10, Richard Elling <
> richard.ell...@richardelling.com> wrote:
>
>> Hi Phil, long time, no see. comments embedded below
>>
>>
>> On Tue, Dec 6, 2022 at 3:35 AM Phil Harman <phil.har...@gmail.com> wrote:
>>
>>> I have a number of "ZFS backup" servers (about 1PB split between four
>>> machines).
>>>
>>> Some of them have 16x 18TB drives, but a couple have a mix of 12TB and
>>> 14TB drives (because that's what we had).
>>>
>>> All are running Ubuntu 20.04 LTS (with a snapshot build) or 22.04 LTS
>>> (bundled).
>>>
>>> We're just doing our first replace (actually swapping in a 16TB drive
>>> for a 12TB because that's all we have).
>>>
>>> root@hqs1:~# zpool version
>>> zfs-2.1.99-784_gae07fc139
>>> zfs-kmod-2.1.99-784_gae07fc139
>>> root@hqs1:~# zpool status
>>>   pool: hqs1p1
>>>  state: DEGRADED
>>> status: One or more devices is currently being resilvered.  The pool will
>>> continue to function, possibly in a degraded state.
>>> action: Wait for the resilver to complete.
>>>   scan: resilver in progress since Mon Dec  5 15:19:31 2022
>>> 177T scanned at 2.49G/s, 175T issued at 2.46G/s, 177T total
>>> 7.95T resilvered, 98.97% done, 00:12:39 to go
>>> config:
>>>
>>> NAME                                      STATE     READ WRITE CKSUM
>>> hqs1p1                                    DEGRADED     0     0     0
>>>  draid2:3d:16c:1s-0                      DEGRADED     0     0     0
>>>    780530e1-d2e4-0040-aa8b-8c7bed75a14a  ONLINE       0     0     0
>>>  (resilvering)
>>>    9c4428e8-d16f-3849-97d9-22fc441750dc  ONLINE       0     0     0
>>>  (resilvering)
>>>    0e148b1d-69a3-3345-9478-343ecf6b855d  ONLINE       0     0     0
>>>  (resilvering)
>>>    98208ffe-4b31-564f-832d-5744c809f163  ONLINE       0     0     0
>>>  (resilvering)
>>>    3ac46b0a-9c46-e14f-8137-69227f3a890a  ONLINE       0     0     0
>>>  (resilvering)
>>>    44e8f62f-5d49-c345-9c89-ac82926d42b7  ONLINE       0     0     0
>>>  (resilvering)
>>>    968dbacd-1d85-0b40-a1fc-977a09ac5aaa  ONLINE       0     0     0
>>>  (resilvering)
>>>    e7ca2666-1067-f54c-b723-b464fb0a5fa3  ONLINE       0     0     0
>>>  (resilvering)
>>>    318ff075-8860-e84e-8063-f77775f57a2d  ONLINE       0     0     0
>>>  (resilvering)
>>>    replacing-9                           DEGRADED     0     0     0
>>>      2888151727045752617                 UNAVAIL      0     0     0  was
>>> /dev/disk/by-partuuid/85fa9347-8359-4942-a20d-da1f6016ea48
>>>      sdd                                 ONLINE       0     0     0
>>>  (resilvering)
>>>    fd69f284-d05d-f145-9bdb-0da8a72bf311  ONLINE       0     0     0
>>>  (resilvering)
>>>    f40f997a-33a1-2a4e-bb8d-64223c441f0f  ONLINE       0     0     0
>>>  (resilvering)
>>>    dbc35ea9-95d1-bd40-b79e-90d8a37079a6  ONLINE       0     0     0
>>>  (resilvering)
>>>    ac62bf3e-517e-a444-ae4f-a784b81cd14c  ONLINE       0     0     0
>>>  (resilvering)
>>>    d211031c-54d4-2443-853c-7e5c075b28ab  ONLINE       0     0     0
>>>  (resilvering)
>>>    06ba16e5-05cf-9b45-a267-510bfe98ceb1  ONLINE       0     0     0
>>>  (resilvering)
>>>  draid2:3d:6c:1s-1                       ONLINE       0     0     0
>>>    be297802-095c-7d43-9132-360627ba8ceb  ONLINE       0     0     0
>>>    e849981c-7316-cb47-b926-61d444790518  ONLINE       0     0     0
>>>    bbc6d66d-38e1-c448-9d00-10ba7adcd371  ONLINE       0     0     0
>>>    9fb44c95-5ea6-2347-ae97-38de283f45bf  ONLINE       0     0     0
>>>    b212cae5-5068-8740-b120-0618ad459c1f  ONLINE       0     0     0
>>>    8c771f6b-7d48-e744-9e25-847230fd2fdd  ONLINE       0     0     0
>>> spares
>>>  draid2-0-0                              AVAIL
>>>  draid2-1-0                              AVAIL
>>>
>>> errors: No known data errors
>>> root@hqs1:~#
>>>
>>> Here's a snippet of zpool iostat -v 1 ...
>>>
>>>                                             capacity     operations
>>> bandwidth
>>> pool                                      alloc   free   read  write
>>> read  write
>>> ----------------------------------------  -----  -----  -----  -----
>>>  -----  -----
>>> hqs1p1                                     178T  67.1T  5.68K    259
>>> 585M   144M
>>>   draid2:3d:16c:1s-0                       128T  63.1T  5.66K    259
>>> 585M   144M
>>>     780530e1-d2e4-0040-aa8b-8c7bed75a14a      -      -    208      0
>>>  25.4M      0
>>>     9c4428e8-d16f-3849-97d9-22fc441750dc      -      -   1010      2
>>> 102M  23.7K
>>>     0e148b1d-69a3-3345-9478-343ecf6b855d      -      -    145      0
>>>  34.1M  15.8K
>>>     98208ffe-4b31-564f-832d-5744c809f163      -      -    101      1
>>>  28.3M  7.90K
>>>     3ac46b0a-9c46-e14f-8137-69227f3a890a      -      -    511      0
>>>  53.5M      0
>>>     44e8f62f-5d49-c345-9c89-ac82926d42b7      -      -     12      0
>>>  4.82M      0
>>>     968dbacd-1d85-0b40-a1fc-977a09ac5aaa      -      -     22      0
>>>  5.43M  15.8K
>>>     e7ca2666-1067-f54c-b723-b464fb0a5fa3      -      -    227      2
>>>  36.7M  23.7K
>>>     318ff075-8860-e84e-8063-f77775f57a2d      -      -    999      1
>>>  83.1M  7.90K
>>>     replacing-9                               -      -      0    243
>>>  0   144M
>>>       2888151727045752617                     -      -      0      0
>>>  0      0
>>>       sdd                                     -      -      0    243
>>>  0   144M
>>>     fd69f284-d05d-f145-9bdb-0da8a72bf311      -      -    306      0
>>>  54.5M  15.8K
>>>     f40f997a-33a1-2a4e-bb8d-64223c441f0f      -      -     47      0
>>>  16.9M      0
>>>     dbc35ea9-95d1-bd40-b79e-90d8a37079a6      -      -    234      0
>>>  16.7M  15.8K
>>>     ac62bf3e-517e-a444-ae4f-a784b81cd14c      -      -    417      0
>>>  15.9M      0
>>>     d211031c-54d4-2443-853c-7e5c075b28ab      -      -    911      0
>>>  48.4M  15.8K
>>>     06ba16e5-05cf-9b45-a267-510bfe98ceb1      -      -    643      0
>>>  60.1M  15.8K
>>>   draid2:3d:6c:1s-1                       50.6T  3.96T     19      0
>>> 198K      0
>>>     be297802-095c-7d43-9132-360627ba8ceb      -      -      3      0
>>>  39.5K      0
>>>     e849981c-7316-cb47-b926-61d444790518      -      -      2      0
>>>  23.7K      0
>>>     bbc6d66d-38e1-c448-9d00-10ba7adcd371      -      -      2      0
>>>  23.7K      0
>>>     9fb44c95-5ea6-2347-ae97-38de283f45bf      -      -      3      0
>>>  39.5K      0
>>>     b212cae5-5068-8740-b120-0618ad459c1f      -      -      3      0
>>>  39.5K      0
>>>     8c771f6b-7d48-e744-9e25-847230fd2fdd      -      -      1      0
>>>  31.6K      0
>>> ----------------------------------------  -----  -----  -----  -----
>>>  -----  -----
>>>
>>> Lots of DRAID goodness there. Seems to be resilvering a q good whack.
>>>
>>> My main question is: why have the DRAID vdevs been so disproportionately
>>> allocated?
>>>
>>
>> When doing a replace to disk, which is not the same as replace to logical
>> spare, then the
>> disk's write performance is the bottleneck. The draid rebuild time
>> improvements only apply
>> when rebuilding onto a logical spare.
>>
>> The data should be spread about nicely, but I don't think you'll see that
>> with a short time interval.
>> Over perhaps 100 seconds or so, it should look more randomly spread.
>>  -- richard
>>
>>
>>
>>>
>>> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer /
>> see discussions <https://openzfs.topicbox.com/groups/developer> +
>> participants <https://openzfs.topicbox.com/groups/developer/members> +
>> delivery options
>> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
>> <https://openzfs.topicbox.com/groups/developer/T386dac0e170785f1-Md3fbc777aeb31f7281232275>
>>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T386dac0e170785f1-Me4ad6a5093ae3412cc99d80b
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] DRAID impressions

Reply via email to