On Wed, Oct 15, 2014 at 11:28 AM, Steven Hartland <[email protected]>
wrote:

> ----- Original Message ----- From: "Steven Hartland via illumos-zfs" <
> [email protected]>
> snip
>
>> Thanks Matt be interested to see of the steam file above changes that.
>>
>>  I took a look at the pool image you linked.  The "leaked" space is
>>> consumed
>>> by deferred frees (zdb -bb will tell you this).  My first guess would be
>>> that something is causing us to not process the deferred free list when
>>> we
>>> should.  i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it
>>> is subsequently syncing real changes to disk and adding more deferred
>>> frees.  This code looks somewhat fragile -- we are essentially checking
>>> "will there be changes this txg", but the answer is not enforced.  There
>>> were some changes to dsl_scan_active() as part of the commit you
>>> mentioned,
>>> which could be involved.
>>>
>>
>> The limited tests I did last night after finding the problem revision
>> seem to agree
>> with that as dtrace was showing that without this revision dsl_scan_sync
>> was
>> always processing the same txg but with it txg was continually
>> incrementing e.g.
>>
>>  snip
>
> Ok so looking through this code the one thing that doesn't make sense to
> me is
> the assignment of cn->scn_async_stalled in dsl_scan_sync.
>
> I believe scn_async_stalled should only be set if we made no progress,
> however
> in its also current form its being set even when bptree_is_empty indicates
> everything
> actually completed.


> The reason that scn_visited_this_txg is zero in this case is that in
> traverse_visitbp
> we're triggering the early return case: bp->blk_birth <= td->td_min_txg
> which means
> td->td_func (dsl_scan_free_block_cb) isn't called hence
> scn_visited_this_txg is never
> incremented.


> This early return was added by:
> https://github.com/freebsd/freebsd/commit/ede49d4106ad9e6d34a2592322e486
> 04f72605bd
>
> In this case we appear to be destorying the temporary clones created by
> the recieve as
> shown by the pool internal history:
>
> 2014-10-15.15:09:17 [txg:9] receive zfs/test/ROOT (48) 2014-10-15.15:09:17
> [txg:10] set zfs/test/ROOT (48) $hasrecvd=
> 2014-10-15.15:09:18 [txg:31] finish receiving zfs/test/ROOT (48)
> snap=auto-2014-08-16_04.00
> 2014-10-15.15:09:18 [txg:31] snapshot zfs/test/ROOT@auto-2014-08-16_04.00
> (94) 2014-10-15.15:09:18 [txg:32] receive zfs/test/ROOT/%recv (100)
> 2014-10-15.15:09:18 [txg:33] finish receiving zfs/test/ROOT/%recv (100)
> snap=auto-2014-08-17_04.00
> 2014-10-15.15:09:18 [txg:33] clone swap zfs/test/ROOT/%recv (100)
> parent=ROOT
> 2014-10-15.15:09:18 [txg:33] snapshot zfs/test/ROOT@auto-2014-08-17_04.00
> (106) 2014-10-15.15:09:18 [txg:33] destroy zfs/test/ROOT/%recv (100)
> 2014-10-15.15:09:18 [txg:34] receive zfs/test/ROOT/%recv (112)
> 2014-10-15.15:09:18 [txg:35] finish receiving zfs/test/ROOT/%recv (112)
> snap=auto-2014-08-19_04.00
> 2014-10-15.15:09:18 [txg:35] clone swap zfs/test/ROOT/%recv (112)
> parent=ROOT
> 2014-10-15.15:09:18 [txg:35] snapshot zfs/test/ROOT@auto-2014-08-19_04.00
> (117) 2014-10-15.15:09:18 [txg:35] destroy zfs/test/ROOT/%recv (112)
>
> It seems seems like the actual intention of this code was the following,
> which
> also seems to prevent the leak:
>
> svn diff -x -p
> Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
> ===================================================================
> --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c   (revision
> 273115)
> +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c   (working
> copy)
> @@ -1476,12 +1476,6 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
>                            "traverse_dataset_destroyed()", err);
>                }
>
> -               /*
> -                * If we didn't make progress, mark the async destroy as
> -                * stalled, so that we will not initiate a spa_sync() on
> -                * its behalf.
> -                */
> -               scn->scn_async_stalled = (scn->scn_visited_this_txg == 0);
>
>                if (bptree_is_empty(dp->dp_meta_objset,
> dp->dp_bptree_obj)) {
>                        /* finished; deactivate async destroy feature */
> @@ -1495,6 +1489,13 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
>                            dp->dp_bptree_obj, tx));
>                        dp->dp_bptree_obj = 0;
>                        scn->scn_async_destroying = B_FALSE;
> +               } else {
> +                       /*
> +                        * If we didn't make progress, mark the async
> destroy as
> +                        * stalled, so that we will not initiate a
> spa_sync() on
> +                        * its behalf.
> +                        */
> +                       scn->scn_async_stalled =
> (scn->scn_visited_this_txg == 0);
>                }
>        }
>        if (scn->scn_visited_this_txg) {
>
> So my questions are:
> 1. Is this actual intention of the code?
>

Yes.  If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.

spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude that
no changes will be synced this txg, so we don't spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again.  Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys.  This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed.  The dirty buffer then needs to be written out.

I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).

In addition to the patch you have above, I think we need to do some work to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.

2. Is there a way to cleanup the leaked deferred frees on an existing pool?
>

If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again.  Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.


> 3. Is there a way to import a pool "writable" which has so many frees its
> now full
>    and failing to complete the import with ENOSPC?
>

Not that I know of, unfortunately.

4. Should these deferred free's have ever multiplied like this or should we
> be preventing
>    the multiple storage of deferred frees?
>
>
I'm not sure what you mean.  There can normally be more than one block
pointer in the deferred free list.  Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?

--matt
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to