Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

Mark Millard Sun, 03 Sep 2023 23:36:46 -0700

On Sep 3, 2023, at 22:06, Alexander Motin <[email protected]> wrote:

> 
> On 03.09.2023 22:54, Mark Millard wrote:
>> After that ^t produced the likes of:
>> load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
>> 13004k
> 
> So the full state is not "tx->tx", but is actually a 
> "tx->tx_quiesce_done_cv", which means the thread is waiting for new 
> transaction to be opened, which means some previous to be quiesced and then 
> synced.
> 
>> #0 0xffffffff80b6f103 at mi_switch+0x173
>> #1 0xffffffff80bc0f24 at sleepq_switch+0x104
>> #2 0xffffffff80aec4c5 at _cv_wait+0x165
>> #3 0xffffffff82aba365 at txg_wait_open+0xf5
>> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151
> 
> Here it seems like transaction commit is waited due to large amount of delete 
> operations, which ZFS tries to spread between separate TXGs.


That fit the context: cleaning out /usr/local/poudriere/data/.m/

> You should probably see some large and growing number in sysctl 
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .

After the reboot I started a -J64 example. It has avoided the
early "witness exhausted". Again I ^C'd after about an hours
after the 2nd builder had started. So: again cleaning out
/usr/local/poudriere/data/.m/ Only seconds between:

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027

I have found a measure of progress: zfs list's USED
for /usr/local/poudriere/data/.m is decreasing. So
ztop's d/s was a good classification: deletes.

>> #5 0xffffffff829a87d2 at zfs_rmnode+0x72
>> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
>> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
>> #8 0xffffffff80c5a7d9 at vgonel+0x3a9
>> #9 0xffffffff80c5af7f at vrecycle+0x3f
>> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
>> #11 0xffffffff80c598cf at vinactivef+0xbf
>> #12 0xffffffff80c590da at vput_final+0x2aa
>> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6
>> #14 0xffffffff80c68588 at sys_unlink+0x28
>> #15 0xffffffff8106323f at amd64_syscall+0x14f
>> #16 0xffffffff8103512b at fast_syscall_common+0xf8
> 
> What we don't see here is what quiesce and sync threads of the pool are 
> actually doing.  Sync thread has plenty of different jobs, including async 
> write, async destroy, scrub and others, that all may delay each other.
> 
> Before you rebooted the system, depending how alive it is, could you save a 
> number of outputs of `procstat -akk`, or at least specifically `procstat -akk 
> | grep txg_thread_enter` if the full is hard?  Or somehow else observe what 
> they are doing.

# procstat -akk > ~/mmjnk00.txt
# procstat -akk > ~/mmjnk01.txt
# procstat -akk > ~/mmjnk02.txt
# procstat -akk > ~/mmjnk03.txt
# procstat -akk > ~/mmjnk04.txt
# procstat -akk > ~/mmjnk05.txt
# grep txg_thread_enter ~/mmjnk0[0-5].txt
/usr/home/root/mmjnk00.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk00.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk01.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk01.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk02.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk02.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk03.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk03.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk04.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk04.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk05.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk05.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 

(Hopefully that will be a sufficiently useful start.)

> `zpool status`, `zpool get all` and `sysctl -a` would also not harm.

It is a very simple zpool configuration: one partition.
I only use it for bectl BE reasons, not the general
range of reasons for using zfs. I created the media with
my normal content, then checkpointed before doing the
git fetch to start to set up the experiment.

# zpool status
  pool: zamd64
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
checkpoint: created Sun Sep  3 11:46:54 2023, consumes 2.17M
config:

        NAME            STATE     READ WRITE CKSUM
        zamd64          ONLINE       0     0     0
          gpt/amd64zfs  ONLINE       0     0     0

errors: No known data errors

There was also a snapshot in place before I did the
checkpoint operation.

I deliberately did not use my typical openzfs-2.1-freebsd 
for compatibility but used defaults when creating the pool:

# zpool get all
NAME    PROPERTY                       VALUE                          SOURCE
zamd64  size                           872G                           -
zamd64  capacity                       21%                            -
zamd64  altroot                        -                              default
zamd64  health                         ONLINE                         -
zamd64  guid                           4817074778276814820            -
zamd64  version                        -                              default
zamd64  bootfs                         zamd64/ROOT/main-amd64         local
zamd64  delegation                     on                             default
zamd64  autoreplace                    off                            default
zamd64  cachefile                      -                              default
zamd64  failmode                       wait                           default
zamd64  listsnapshots                  off                            default
zamd64  autoexpand                     off                            default
zamd64  dedupratio                     1.00x                          -
zamd64  free                           688G                           -
zamd64  allocated                      184G                           -
zamd64  readonly                       off                            -
zamd64  ashift                         0                              default
zamd64  comment                        -                              default
zamd64  expandsize                     -                              -
zamd64  freeing                        0                              -
zamd64  fragmentation                  17%                            -
zamd64  leaked                         0                              -
zamd64  multihost                      off                            default
zamd64  checkpoint                     2.17M                          -
zamd64  load_guid                      17719601284614388220           -
zamd64  autotrim                       off                            default
zamd64  compatibility                  off                            default
zamd64  bcloneused                     0                              -
zamd64  bclonesaved                    0                              -
zamd64  bcloneratio                    1.00x                          -
zamd64  feature@async_destroy          enabled                        local
zamd64  feature@empty_bpobj            active                         local
zamd64  feature@lz4_compress           active                         local
zamd64  feature@multi_vdev_crash_dump  enabled                        local
zamd64  feature@spacemap_histogram     active                         local
zamd64  feature@enabled_txg            active                         local
zamd64  feature@hole_birth             active                         local
zamd64  feature@extensible_dataset     active                         local
zamd64  feature@embedded_data          active                         local
zamd64  feature@bookmarks              enabled                        local
zamd64  feature@filesystem_limits      enabled                        local
zamd64  feature@large_blocks           enabled                        local
zamd64  feature@large_dnode            enabled                        local
zamd64  feature@sha512                 enabled                        local
zamd64  feature@skein                  enabled                        local
zamd64  feature@edonr                  enabled                        local
zamd64  feature@userobj_accounting     active                         local
zamd64  feature@encryption             enabled                        local
zamd64  feature@project_quota          active                         local
zamd64  feature@device_removal         enabled                        local
zamd64  feature@obsolete_counts        enabled                        local
zamd64  feature@zpool_checkpoint       active                         local
zamd64  feature@spacemap_v2            active                         local
zamd64  feature@allocation_classes     enabled                        local
zamd64  feature@resilver_defer         enabled                        local
zamd64  feature@bookmark_v2            enabled                        local
zamd64  feature@redaction_bookmarks    enabled                        local
zamd64  feature@redacted_datasets      enabled                        local
zamd64  feature@bookmark_written       enabled                        local
zamd64  feature@log_spacemap           active                         local
zamd64  feature@livelist               enabled                        local
zamd64  feature@device_rebuild         enabled                        local
zamd64  feature@zstd_compress          enabled                        local
zamd64  feature@draid                  enabled                        local
zamd64  feature@zilsaxattr             active                         local
zamd64  feature@head_errlog            active                         local
zamd64  feature@blake3                 enabled                        local
zamd64  feature@block_cloning          enabled                        local
zamd64  feature@vdev_zaps_v2           active                         local
zamd64  feature@redaction_list_spill   disabled                       local

/etc/sysctl.conf does have:

vfs.zfs.min_auto_ashift=12
vfs.zfs.per_txg_dirty_frees_percent=5

The vfs.zfs.per_txg_dirty_frees_percent is from prior
Mateusz Guzik help, where after testing the change I
reported:

Result summary: Seems to have avoided the sustained periods
of low load average activity. Much better for the context.

But it was for a different machine (aarch64, 8 cores). But
it was for poudriere bulk use.

Turns out the default of 30 was causing sort of like
what is seen here: I could have presented some of the
information via the small load average figures here.

(Note: 5 is the old default, 30 is newer. Other contexts
have other problems with 5: no single right setting and
no automated configuration.)

Other than those 2 items, zfs is untuned (defaults).

sysctl -a is a lot more output (864930 Bytes) so I'll skip
it for now.

> PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make 
> sure there is no storage problems, like some huge delays, timeouts, etc, that 
> can be seen, for example, as busy percents regularly spiking far above 100% 
> in your `gstat -spod`.
> 

The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w,
mostly at the lower end of the range. < 98%busy, no spikes to > 100%.
It is a previously unused Samsung PSSD T7 Touch.

I was not prepared to replace the content of a PCIe slot's media
or M.2 connection's media for the temporary purpose. No spare
supply for those so no simple swapping for those.


===
Mark Millard
marklmi at yahoo.com

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

Reply via email to