On Sep 3, 2023, at 22:06, Alexander Motin <[email protected]> wrote:
>
> On 03.09.2023 22:54, Mark Millard wrote:
>> After that ^t produced the likes of:
>> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1%
>> 13004k
>
> So the full state is not "tx->tx", but is actually a
> "tx->tx_quiesce_done_cv", which means the thread is waiting for new
> transaction to be opened, which means some previous to be quiesced and then
> synced.
>
>> #0 0xffffffff80b6f103 at mi_switch+0x173
>> #1 0xffffffff80bc0f24 at sleepq_switch+0x104
>> #2 0xffffffff80aec4c5 at _cv_wait+0x165
>> #3 0xffffffff82aba365 at txg_wait_open+0xf5
>> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151
>
> Here it seems like transaction commit is waited due to large amount of delete
> operations, which ZFS tries to spread between separate TXGs.
That fit the context: cleaning out /usr/local/poudriere/data/.m/
> You should probably see some large and growing number in sysctl
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .
After the reboot I started a -J64 example. It has avoided the
early "witness exhausted". Again I ^C'd after about an hours
after the 2nd builder had started. So: again cleaning out
/usr/local/poudriere/data/.m/ Only seconds between:
# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042
# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427
# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323
# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027
I have found a measure of progress: zfs list's USED
for /usr/local/poudriere/data/.m is decreasing. So
ztop's d/s was a good classification: deletes.
>> #5 0xffffffff829a87d2 at zfs_rmnode+0x72
>> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
>> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
>> #8 0xffffffff80c5a7d9 at vgonel+0x3a9
>> #9 0xffffffff80c5af7f at vrecycle+0x3f
>> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
>> #11 0xffffffff80c598cf at vinactivef+0xbf
>> #12 0xffffffff80c590da at vput_final+0x2aa
>> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6
>> #14 0xffffffff80c68588 at sys_unlink+0x28
>> #15 0xffffffff8106323f at amd64_syscall+0x14f
>> #16 0xffffffff8103512b at fast_syscall_common+0xf8
>
> What we don't see here is what quiesce and sync threads of the pool are
> actually doing. Sync thread has plenty of different jobs, including async
> write, async destroy, scrub and others, that all may delay each other.
>
> Before you rebooted the system, depending how alive it is, could you save a
> number of outputs of `procstat -akk`, or at least specifically `procstat -akk
> | grep txg_thread_enter` if the full is hard? Or somehow else observe what
> they are doing.
# procstat -akk > ~/mmjnk00.txt
# procstat -akk > ~/mmjnk01.txt
# procstat -akk > ~/mmjnk02.txt
# procstat -akk > ~/mmjnk03.txt
# procstat -akk > ~/mmjnk04.txt
# procstat -akk > ~/mmjnk05.txt
# grep txg_thread_enter ~/mmjnk0[0-5].txt
/usr/home/root/mmjnk00.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk00.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk04.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk04.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk05.txt: 6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk05.txt: 6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
(Hopefully that will be a sufficiently useful start.)
> `zpool status`, `zpool get all` and `sysctl -a` would also not harm.
It is a very simple zpool configuration: one partition.
I only use it for bectl BE reasons, not the general
range of reasons for using zfs. I created the media with
my normal content, then checkpointed before doing the
git fetch to start to set up the experiment.
# zpool status
pool: zamd64
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
checkpoint: created Sun Sep 3 11:46:54 2023, consumes 2.17M
config:
NAME STATE READ WRITE CKSUM
zamd64 ONLINE 0 0 0
gpt/amd64zfs ONLINE 0 0 0
errors: No known data errors
There was also a snapshot in place before I did the
checkpoint operation.
I deliberately did not use my typical openzfs-2.1-freebsd
for compatibility but used defaults when creating the pool:
# zpool get all
NAME PROPERTY VALUE SOURCE
zamd64 size 872G -
zamd64 capacity 21% -
zamd64 altroot - default
zamd64 health ONLINE -
zamd64 guid 4817074778276814820 -
zamd64 version - default
zamd64 bootfs zamd64/ROOT/main-amd64 local
zamd64 delegation on default
zamd64 autoreplace off default
zamd64 cachefile - default
zamd64 failmode wait default
zamd64 listsnapshots off default
zamd64 autoexpand off default
zamd64 dedupratio 1.00x -
zamd64 free 688G -
zamd64 allocated 184G -
zamd64 readonly off -
zamd64 ashift 0 default
zamd64 comment - default
zamd64 expandsize - -
zamd64 freeing 0 -
zamd64 fragmentation 17% -
zamd64 leaked 0 -
zamd64 multihost off default
zamd64 checkpoint 2.17M -
zamd64 load_guid 17719601284614388220 -
zamd64 autotrim off default
zamd64 compatibility off default
zamd64 bcloneused 0 -
zamd64 bclonesaved 0 -
zamd64 bcloneratio 1.00x -
zamd64 feature@async_destroy enabled local
zamd64 feature@empty_bpobj active local
zamd64 feature@lz4_compress active local
zamd64 feature@multi_vdev_crash_dump enabled local
zamd64 feature@spacemap_histogram active local
zamd64 feature@enabled_txg active local
zamd64 feature@hole_birth active local
zamd64 feature@extensible_dataset active local
zamd64 feature@embedded_data active local
zamd64 feature@bookmarks enabled local
zamd64 feature@filesystem_limits enabled local
zamd64 feature@large_blocks enabled local
zamd64 feature@large_dnode enabled local
zamd64 feature@sha512 enabled local
zamd64 feature@skein enabled local
zamd64 feature@edonr enabled local
zamd64 feature@userobj_accounting active local
zamd64 feature@encryption enabled local
zamd64 feature@project_quota active local
zamd64 feature@device_removal enabled local
zamd64 feature@obsolete_counts enabled local
zamd64 feature@zpool_checkpoint active local
zamd64 feature@spacemap_v2 active local
zamd64 feature@allocation_classes enabled local
zamd64 feature@resilver_defer enabled local
zamd64 feature@bookmark_v2 enabled local
zamd64 feature@redaction_bookmarks enabled local
zamd64 feature@redacted_datasets enabled local
zamd64 feature@bookmark_written enabled local
zamd64 feature@log_spacemap active local
zamd64 feature@livelist enabled local
zamd64 feature@device_rebuild enabled local
zamd64 feature@zstd_compress enabled local
zamd64 feature@draid enabled local
zamd64 feature@zilsaxattr active local
zamd64 feature@head_errlog active local
zamd64 feature@blake3 enabled local
zamd64 feature@block_cloning enabled local
zamd64 feature@vdev_zaps_v2 active local
zamd64 feature@redaction_list_spill disabled local
/etc/sysctl.conf does have:
vfs.zfs.min_auto_ashift=12
vfs.zfs.per_txg_dirty_frees_percent=5
The vfs.zfs.per_txg_dirty_frees_percent is from prior
Mateusz Guzik help, where after testing the change I
reported:
Result summary: Seems to have avoided the sustained periods
of low load average activity. Much better for the context.
But it was for a different machine (aarch64, 8 cores). But
it was for poudriere bulk use.
Turns out the default of 30 was causing sort of like
what is seen here: I could have presented some of the
information via the small load average figures here.
(Note: 5 is the old default, 30 is newer. Other contexts
have other problems with 5: no single right setting and
no automated configuration.)
Other than those 2 items, zfs is untuned (defaults).
sysctl -a is a lot more output (864930 Bytes) so I'll skip
it for now.
> PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make
> sure there is no storage problems, like some huge delays, timeouts, etc, that
> can be seen, for example, as busy percents regularly spiking far above 100%
> in your `gstat -spod`.
>
The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w,
mostly at the lower end of the range. < 98%busy, no spikes to > 100%.
It is a previously unused Samsung PSSD T7 Touch.
I was not prepared to replace the content of a PCIe slot's media
or M.2 connection's media for the temporary purpose. No spare
supply for those so no simple swapping for those.
===
Mark Millard
marklmi at yahoo.com