On Wed, Oct 15, 2014 at 11:28 AM, Steven Hartland <[email protected]> wrote:
> ----- Original Message ----- From: "Steven Hartland via illumos-zfs" < > [email protected]> > snip > >> Thanks Matt be interested to see of the steam file above changes that. >> >> I took a look at the pool image you linked. The "leaked" space is >>> consumed >>> by deferred frees (zdb -bb will tell you this). My first guess would be >>> that something is causing us to not process the deferred free list when >>> we >>> should. i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it >>> is subsequently syncing real changes to disk and adding more deferred >>> frees. This code looks somewhat fragile -- we are essentially checking >>> "will there be changes this txg", but the answer is not enforced. There >>> were some changes to dsl_scan_active() as part of the commit you >>> mentioned, >>> which could be involved. >>> >> >> The limited tests I did last night after finding the problem revision >> seem to agree >> with that as dtrace was showing that without this revision dsl_scan_sync >> was >> always processing the same txg but with it txg was continually >> incrementing e.g. >> >> snip > > Ok so looking through this code the one thing that doesn't make sense to > me is > the assignment of cn->scn_async_stalled in dsl_scan_sync. > > I believe scn_async_stalled should only be set if we made no progress, > however > in its also current form its being set even when bptree_is_empty indicates > everything > actually completed. > The reason that scn_visited_this_txg is zero in this case is that in > traverse_visitbp > we're triggering the early return case: bp->blk_birth <= td->td_min_txg > which means > td->td_func (dsl_scan_free_block_cb) isn't called hence > scn_visited_this_txg is never > incremented. > This early return was added by: > https://github.com/freebsd/freebsd/commit/ede49d4106ad9e6d34a2592322e486 > 04f72605bd > > In this case we appear to be destorying the temporary clones created by > the recieve as > shown by the pool internal history: > > 2014-10-15.15:09:17 [txg:9] receive zfs/test/ROOT (48) 2014-10-15.15:09:17 > [txg:10] set zfs/test/ROOT (48) $hasrecvd= > 2014-10-15.15:09:18 [txg:31] finish receiving zfs/test/ROOT (48) > snap=auto-2014-08-16_04.00 > 2014-10-15.15:09:18 [txg:31] snapshot zfs/test/ROOT@auto-2014-08-16_04.00 > (94) 2014-10-15.15:09:18 [txg:32] receive zfs/test/ROOT/%recv (100) > 2014-10-15.15:09:18 [txg:33] finish receiving zfs/test/ROOT/%recv (100) > snap=auto-2014-08-17_04.00 > 2014-10-15.15:09:18 [txg:33] clone swap zfs/test/ROOT/%recv (100) > parent=ROOT > 2014-10-15.15:09:18 [txg:33] snapshot zfs/test/ROOT@auto-2014-08-17_04.00 > (106) 2014-10-15.15:09:18 [txg:33] destroy zfs/test/ROOT/%recv (100) > 2014-10-15.15:09:18 [txg:34] receive zfs/test/ROOT/%recv (112) > 2014-10-15.15:09:18 [txg:35] finish receiving zfs/test/ROOT/%recv (112) > snap=auto-2014-08-19_04.00 > 2014-10-15.15:09:18 [txg:35] clone swap zfs/test/ROOT/%recv (112) > parent=ROOT > 2014-10-15.15:09:18 [txg:35] snapshot zfs/test/ROOT@auto-2014-08-19_04.00 > (117) 2014-10-15.15:09:18 [txg:35] destroy zfs/test/ROOT/%recv (112) > > It seems seems like the actual intention of this code was the following, > which > also seems to prevent the leak: > > svn diff -x -p > Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c > =================================================================== > --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (revision > 273115) > +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (working > copy) > @@ -1476,12 +1476,6 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx) > "traverse_dataset_destroyed()", err); > } > > - /* > - * If we didn't make progress, mark the async destroy as > - * stalled, so that we will not initiate a spa_sync() on > - * its behalf. > - */ > - scn->scn_async_stalled = (scn->scn_visited_this_txg == 0); > > if (bptree_is_empty(dp->dp_meta_objset, > dp->dp_bptree_obj)) { > /* finished; deactivate async destroy feature */ > @@ -1495,6 +1489,13 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx) > dp->dp_bptree_obj, tx)); > dp->dp_bptree_obj = 0; > scn->scn_async_destroying = B_FALSE; > + } else { > + /* > + * If we didn't make progress, mark the async > destroy as > + * stalled, so that we will not initiate a > spa_sync() on > + * its behalf. > + */ > + scn->scn_async_stalled = > (scn->scn_visited_this_txg == 0); > } > } > if (scn->scn_visited_this_txg) { > > So my questions are: > 1. Is this actual intention of the code? > Yes. If there there is an entry in the dp_bptree_obj (and therefore SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty (there's nothing to traverse in it), then we could incorrectly set scn_async_stalled, leading to this behavior. spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude that no changes will be synced this txg, so we don't spa_sync_deferred_frees(). However, dsl_scan_sync() will see that scn_async_stalled is set, and therefore try again. Though we don't actually try to process the bptree again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do bpobj_iterate() to process potential background snapshot destroys. This always dirties the bpobj's bonus buffer, even if nothing is actually changed. The dirty buffer then needs to be written out. I was able to reproduce the bug on illumos using your send stream. However, touching a file in the pool caused spa_sync_deferred_frees() to be called, thus resetting the free space back to where it should be (but it the free space continued reducing again after that). In addition to the patch you have above, I think we need to do some work to make the logic around deferred frees more robust, to prevent similar bugs from creeping in in the future. 2. Is there a way to cleanup the leaked deferred frees on an existing pool? > If it hasn't totally run out of space, making any changes in the pool will clean up the deferred frees (e.g. create a new file), but the free space will start decreasing again. Rebooting (or export/import the pool) will clear scn_async_stalled so the problem should go away entirely. > 3. Is there a way to import a pool "writable" which has so many frees its > now full > and failing to complete the import with ENOSPC? > Not that I know of, unfortunately. 4. Should these deferred free's have ever multiplied like this or should we > be preventing > the multiple storage of deferred frees? > > I'm not sure what you mean. There can normally be more than one block pointer in the deferred free list. Are you suggesting that the same blkptr_t has been added to the deferred free list multiple times? --matt
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
