TRIM functionality to ZFS and Illumos (#172)

Matthew Ahrens Tue, 02 May 2017 17:07:39 -0700

ahrens commented on this pull request.



> @@ -2162,8 +2184,23 @@ metaslab_condense(metaslab_t *msp, uint64_t txg, 
> dmu_tx_t *tx)
        range_tree_vacate(condense_tree, NULL, NULL);
        range_tree_destroy(condense_tree);
 
-       space_map_write(sm, msp->ms_tree, SM_FREE, tx);
+       if (msp->ms_trimming_ts == NULL) {
+               space_map_write(sm, msp->ms_tree, SM_FREE, tx);
+       } else {
+               /*
+                * While trimming, the stuff being trimmed isn't in ms_tree,
+                * but we still want our persistent state to reflect that. So
+                * we construct a temporary union of the two trees.
+                */
+               range_tree_t *rt = range_tree_create(NULL, NULL, &msp->ms_lock);
+               range_tree_walk(msp->ms_tree, range_tree_add, rt);
+               range_tree_walk(msp->ms_trimming_ts, range_tree_add, rt);

The performance of these range_tree_walk()'s could be pretty bad.  It's similar 
to the problem described in the comment above (old line 2152). How about 
instead doing:
```
space_map_write(sm, msp->ms_tree, SM_FREE, tx);
space_map_write(sm, msp->ms_trimming_ts, SM_FREE, tx);
```
The resulting spacemap will be slightly less optimal, but much faster to write.

> +
+/*
+ * Notifies the trimsets in a metaslab that an extent has been freed.
+ * This adds the segment to the currently open queue of extents awaiting
+ * to be trimmed.
+ */
+static void
+metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
+{
+       metaslab_t *msp = arg;
+
+       ASSERT(MUTEX_HELD(&msp->ms_lock));
+       ASSERT(msp->ms_cur_ts != NULL);
+       range_tree_add(msp->ms_cur_ts, offset, size);
+       ASSERT(msp->ms_prev_ts == NULL ||
+           !range_tree_contains_part(msp->ms_prev_ts, offset, size));

can we also assert that it's not in the trimming_ts?

>                       range_tree_verify(msp->ms_tree, offset, size);
+                       if (msp->ms_trimming_ts) {
+                               range_tree_verify(msp->ms_trimming_ts,
+                                   offset, size);

shouldn't we do this (and the new code below) even if the metaslab is not 
loaded?

>                       range_tree_verify(msp->ms_tree, offset, size);
+                       if (msp->ms_trimming_ts) {
+                               range_tree_verify(msp->ms_trimming_ts,
+                                   offset, size);
+                       }
+#ifdef DEBUG

This code is already gated by `ZFS_DEBUG_ZIO_FREE`, why do we also want to skip 
this on nondebug bits?  I imagine that (if loaded) ms_tree is typically much 
larger, and we're doing that even on nondebug.

>       }
        spa_config_exit(spa, SCL_VDEV, FTAG);
 }
+
+/*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+ * caller should zio_wait() for) and the amount of space in the metaslab that
+ * has been scheduled for trimming in the `delta' return argument.
+ */
+zio_t *
+metaslab_trim_all(metaslab_t *msp, uint64_t *cursor, uint64_t *delta,
+    boolean_t *was_loaded)
+{
+       uint64_t cur = *cursor, trimmed_space = 0;

use one line per variable here (they're different types, and initialized 
differently).

>       }
        spa_config_exit(spa, SCL_VDEV, FTAG);
 }
+
+/*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+ * caller should zio_wait() for) and the amount of space in the metaslab that
+ * has been scheduled for trimming in the `delta' return argument.
+ */
+zio_t *
+metaslab_trim_all(metaslab_t *msp, uint64_t *cursor, uint64_t *delta,
+    boolean_t *was_loaded)
+{
+       uint64_t cur = *cursor, trimmed_space = 0;
+       zio_t *trim_io = NULL;
+       range_seg_t rsearch, *rs;

wouldn't hurt to declare these on their own lines too - helps to not misread it 
since they are different types.

>       }
        spa_config_exit(spa, SCL_VDEV, FTAG);
 }
+
+/*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+ * caller should zio_wait() for) and the amount of space in the metaslab that
+ * has been scheduled for trimming in the `delta' return argument.

can you also clarify the was_loaded return value?  I could interpret it as "was 
this metaslab loaded by this routine" or "was this metaslab loaded before this 
routine was called" (which, seems like the caller could figure out, but I guess 
this way is better for the locking?)

also it's only set when *cursor is at the beginning of this ms.

>       }
        spa_config_exit(spa, SCL_VDEV, FTAG);
 }
+
+/*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the

it doesn't trim all free space.  It trims starting at *cursor, and trims at 
most zfs_max_bytes_per_trim.

> +      * On the initial call we memorize if we had to load the metaslab
+        * for ourselves, so we can unload it when we're done.
+        */
+       if (cur == msp->ms_start)
+               *was_loaded = msp->ms_loaded;
+       if (!msp->ms_loaded) {
+               if (metaslab_load(msp) != 0) {
+                       /* Load failed, stop trimming this metaslab */
+                       *cursor = msp->ms_start + msp->ms_size;
+                       mutex_exit(&msp->ms_lock);
+                       return (NULL);
+               }
+       }
+
+       /*
+        * Flush out any scheduled extents and add everything in ms_tree

"flush out" could mean "write out", but here you mean "forget about".

> +      * Flush out any scheduled extents and add everything in ms_tree
+        * from the last cursor position, but not more than the trim run
+        * limit.
+        */
+       range_tree_vacate(msp->ms_cur_ts, NULL, NULL);
+
+       rsearch.rs_start = cur;
+       rsearch.rs_end = cur + SPA_MINBLOCKSIZE;
+       rs = avl_find(&msp->ms_tree->rt_root, &rsearch, &where);
+       if (rs == NULL) {
+               rs = avl_nearest(&msp->ms_tree->rt_root, where, AVL_AFTER);
+               if (rs != NULL)
+                       cur = rs->rs_start;
+       }
+
+       /* Clear out ms_prev_ts, since we'll be trimming everything. */

would probably make more sense to do this above, just after vacating ms_cur_ts?

> +     }
+
+       /*
+        * Flush out any scheduled extents and add everything in ms_tree
+        * from the last cursor position, but not more than the trim run
+        * limit.
+        */
+       range_tree_vacate(msp->ms_cur_ts, NULL, NULL);
+
+       rsearch.rs_start = cur;
+       rsearch.rs_end = cur + SPA_MINBLOCKSIZE;
+       rs = avl_find(&msp->ms_tree->rt_root, &rsearch, &where);
+       if (rs == NULL) {
+               rs = avl_nearest(&msp->ms_tree->rt_root, where, AVL_AFTER);
+               if (rs != NULL)
+                       cur = rs->rs_start;

I don't think this is necessary, since you set `cur` below on line 3559

> +             uint64_t end;
+               if (cur < rs->rs_start)
+                       cur = rs->rs_start;
+               end = MIN(cur + (max_bytes - trimmed_space), rs->rs_end);
+               metaslab_trim_add(msp, cur, end - cur);
+               trimmed_space += (end - cur);
+               cur = end;
+               if (cur == rs->rs_end)
+                       rs = AVL_NEXT(&msp->ms_tree->rt_root, rs);
+       }
+
+       if (trimmed_space != 0) {
+               /* Force this trim to take place ASAP. */
+               msp->ms_prev_ts = msp->ms_cur_ts;
+               msp->ms_cur_ts = range_tree_create(NULL, NULL, &msp->ms_lock);
+               trim_io = metaslab_exec_trim(msp, B_FALSE);

This seems a little gross to be hijacking ms_cur_ts / ms_prev_ts.  Maybe it 
would be cleaner to create a local range_tree_t, and pass it to 
metaslab_exec_trim().  Although maybe this is necessary to set ms_trimming_ts 
and remove the ranges from ms_tree.

> +}
+
+/*
+ * Executes a zio_trim on a range tree holding freed extents in the metaslab.
+ * The set of extents is taken from the metaslab's ms_prev_ts. If there is
+ * another trim currently executing on that metaslab, this function blocks
+ * until that trim completes.
+ * The `auto_trim' argument signals whether the trim is being invoked on
+ * behalf of auto or manual trim. The differences are:
+ * 1) For auto trim the trimset is split up into subtrees, each containing no
+ *     more than zfs_max_bytes_per_trim total bytes. Each subtree is then
+ *     trimmed in one zio. This is done to limit the number of LBAs per
+ *     trim command, as many devices perform suboptimally with large trim
+ *     commands, even if they indicate support for them. Manual trim already
+ *     applies this limit earlier by limiting the trimset size, so the
+ *     whole trimset can be issued in a single zio.

Why do we do it this way, vs. e.g. splitting up the ranges here for both manual 
and automatic trim?

> +
+       spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+       ASSERT(vd->vdev_ms[0] != NULL);
+       cursor = vd->vdev_ms[0]->ms_start;
+       i = 0;
+       while (i < vti->vti_vdev->vdev_ms_count && !spa->spa_man_trim_stop) {
+               uint64_t delta;
+               metaslab_t *msp = vd->vdev_ms[i];
+               zio_t *trim_io;
+
+               trim_io = metaslab_trim_all(msp, &cursor, &delta, &was_loaded);
+               spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+
+               if (trim_io != NULL) {
+                       ASSERT3U(cursor, >=, vd->vdev_ms[0]->ms_start);
+                       vd->vdev_trim_prog = cursor - vd->vdev_ms[0]->ms_start;

So vdev_trim_prog is basically the last offset trimmed.  Progress with that 
could be very jumpy, since some metaslabs might have much more free space than 
others.  Also it seems strange that the requested trim rate is expressed in 
"bytes to trim per second" but progress is reported in "bytes of disk 
[allocated or free] per second".  Could we track how much free space we started 
with, and how much space has been trimmed so far (similar to scrub, but for 
free rather than allocated space)?

> +             /* delay loop to handle fixed-rate trimming */
+               for (;;) {
+                       uint64_t rate = spa->spa_man_trim_rate;
+                       uint64_t sleep_delay;
+                       clock_t t1;
+
+                       if (rate == 0) {
+                               /* No delay, just update 't' and move on. */
+                               t = ddi_get_lbolt();
+                               break;
+                       }
+
+                       sleep_delay = (delta * hz) / rate;
+                       mutex_enter(&spa->spa_man_trim_lock);
+                       t1 = cv_timedwait(&spa->spa_man_trim_update_cv,
+                           &spa->spa_man_trim_lock, t + sleep_delay);

We'd like to move away from using ticks/lbolt and instead standardize on 
hrtime, which has fixed units (nanoseconds).  cv_timedwait_hires() takes 
nanoseconds to sleep or an absolute hrtime to sleep until.

> +     limited = mused > mlim;
+       DTRACE_PROBE3(autotrim__mem__lim, vdev_t *, vd, uint64_t, mused,
+           uint64_t, mlim);
+
+       /*
+        * Since we typically have hundreds of metaslabs per vdev, but we only
+        * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
+        * could sequence the TRIM commands from all metaslabs so that they
+        * don't all always pound the device in the same txg. We do so taking
+        * the txg number modulo txgs_per_trim and then skipping by
+        * txgs_per_trim. Thus, for the default 200 metaslabs and 32
+        * txgs_per_trim, we'll only be trimming ~6.25 metaslabs per txg.
+        */
+       for (uint64_t i = txg % txgs_per_trim; i < vd->vdev_ms_count;
+           i += txgs_per_trim)
+               metaslab_auto_trim(vd->vdev_ms[i], !limited);

it might be easier to understand if you used the same boolean sense for 
`limited` and `preserve_spilled`, e.g. use `low_memory` in both vdev_auto_trim 
and metaslab_auto_trim.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/172#pullrequestreview-35910452
------------------------------------------
openzfs-developer
Archives: https://openzfs.topicbox.com/groups/developer/
Powered by Topicbox: https://topicbox.com

[developer] Re: [openzfs/openzfs] 6363 Add UNMAP/TRIM functionality to ZFS and Illumos (#172)

Reply via email to