-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/267/#review877
-----------------------------------------------------------



Hi Sašo,

    I just hit an interesting corner case on in the l2_rebuild_abort_lowmem 
logic.
 
- Import several pools; rebuild goes fine.
- Import pool that at system shutdown was dealing with a huge set (half a TiB) 
of deduped-dataset snapshot destroys.
- This takes about 15-20 min waiting on the spa_namespace_lock mutex (in 
spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock 
mutex again (in spa_async_thread).
- During this time, other activity makes the arc grow big
- The rebuild sees arc_reclaim_needed(), and so the several GB of DDT (among 
other things) on the L2 are effectively discarded.

   I think that the abort-in-lowmem response is a bit aggressive.   In this 
particular example, this is the first time arc_reclaim_needed() was true since 
the system was started.

   Maybe backing off and trying again once or twice before aborting is a 
reasonable thing to do in the generic case? 

   Also, as lundman noted, so far in openzfs nobody else expects thread_join() 
to work in the kernel; my approach was to extend struct l2arc_dev to hold a 
mutex and a condvar,

```
        kmutex_t                l2ad_rebuild_mutex;
        kcondvar_t              l2ad_rebuild_cv;
```

and to do this in l2ar_remove_vdev():

```
        if (remdev->l2ad_rebuild_did != 0) {
                /*
                 * N.B. it should be safe to thread_join with the rebuild
                 * thread while holding l2arc_dev_mtx because it is not
                 * accessed from anywhere in the l2arc rebuild code below
                 * (except for l2arc_spa_rebuild_start, which is ok).
                 */
          mutex_enter(&remdev->l2ad_rebuild_mutex);
          do {
            static int iter = 0;
            dprintf("ZFS: %s: waiting on condvar for rebuild thread %p, iter 
%d\n",
                   __func__, remdev->l2ad_rebuild_did, ++iter);
            cv_timedwait(&remdev->l2ad_rebuild_cv, &remdev->l2ad_rebuild_mutex, 
ddi_get_lbolt()+(hz*5));
          } while(!remdev->l2ad_rebuild_thread_exiting);
          mutex_exit(&remdev->l2ad_rebuild_mutex);
          dprintf("ZFS: %s: thread done %p, destroying cv and mutex, setting 
did to 0\n",
                 __func__, remdev->l2ad_rebuild_did);
          cv_destroy(&remdev->l2ad_rebuild_cv);
          mutex_destroy(&remdev->l2ad_rebuild_mutex);
          remdev->l2ad_rebuild_did = 0;
                //thread_join(remdev->l2ad_rebuild_did);
```

and this in l2arc_spa_rebuild_start():

```
                if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
                        VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
#ifdef  _KERNEL
                        mutex_init(&dev->l2ad_rebuild_mutex, NULL, 
MUTEX_DEFAULT, NULL);
                        cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT, NULL);
                        mutex_enter(&dev->l2ad_rebuild_mutex);
                        dev->l2ad_rebuild_thread_exiting = FALSE;
                        mutex_exit(&dev->l2ad_rebuild_mutex);
                        dev->l2ad_rebuild_did = thread_create(NULL, 0,
                                                              (void 
*)l2arc_dev_rebuild_start,
                                                              dev, 0, &p0, 
TS_RUN,
                                                              minclsyspri);
#endif
              }
```

 
and this in l2arc_dev_rebuild_start()

```
  mutex_enter(&dev->l2ad_rebuild_mutex);
  dev->l2ad_rebuild_thread_exiting = FALSE;
        if (!dev->l2ad_rebuild_cancel) {
          mutex_exit(&dev->l2ad_rebuild_mutex);
          VERIFY(dev->l2ad_rebuild);
          (void) l2arc_rebuild(dev);
          mutex_enter(&dev->l2ad_rebuild_mutex);
          dev->l2ad_rebuild = B_FALSE;
          mutex_exit(&dev->l2ad_rebuild_mutex);
        } else {
          mutex_exit(&dev->l2ad_rebuild_mutex);
        }
        mutex_enter(&dev->l2ad_rebuild_mutex);
        dev->l2ad_rebuild_thread_exiting = TRUE;
        mutex_exit(&dev->l2ad_rebuild_mutex);
        cv_broadcast(&dev->l2ad_rebuild_cv);
        kpreempt(KPREEMPT_SYNC);
        if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) {
          // we didn't get cleaned up in l2arc_remove_vdev
          cv_destroy(&dev->l2ad_rebuild_cv);
          mutex_destroy(&dev->l2ad_rebuild_mutex);
          dev->l2ad_rebuild_did = 0;
        } else if(dev->l2ad_rebuild_did != thr) {
          // zero: we probably got cleaned up in l2arc_remove_vdev, nonzero: 
impossible?
                 __func__, thr, dev->l2ad_rebuild_did);
        }
        dprintf("ZFS: %s calling thread_exit(), thread %p/%p, 
l2ad_rebuild_cancel == %d\n",
               __func__, thr, dev->l2ad_rebuild_did, dev->l2ad_rebuild_cancel);
        thread_exit();
```

Finally, I agree with you in your discussion with Matt Ahrens that the ~51 GB 
lower limit on rebuildable L2 vdevs is a bit strong.

Consider the case where a smaller L2ARC is actually desirable because the 
average block size is very small and turnover is fairly high; the extra l2hdrs 
can be significant pollution, and will accumulate on a larger L2 device.   
However, there is still presumably value in holding on to some smaller number 
of small blocks.

    Sean.

- Sean Doran


On Nov. 5, 2015, 3:31 p.m., Saso Kiselkov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/267/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2015, 3:31 p.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> -------
> 
> This is an upstream port of the Persistent L2ARC feature from Nexenta's repo.
> 
> 
> Diffs
> -----
> 
>   usr/src/uts/common/sys/fs/zfs.h bc9f057dd1361ae73a12375515abacd0fed820d2 
>   usr/src/uts/common/fs/zfs/vdev_label.c 
> f0924ab1e66eaa678540da8925995da6e0e2a29c 
>   usr/src/uts/common/fs/zfs/vdev.c 1c57fce4dcee909b164353181dcd8e2a29ed7946 
>   usr/src/uts/common/fs/zfs/sys/spa.h 
> 7ac78390338a44f7b7658017e1ae8fcc9beb89d6 
>   usr/src/uts/common/fs/zfs/sys/arc.h 
> 899b72114b9909098080a5d6bbad1a60808f030c 
>   usr/src/uts/common/fs/zfs/spa.c 95a6b0fae7760e8a1e8cfc1e657dc22fd9ef3245 
>   usr/src/uts/common/fs/zfs/arc.c 52f582be1633a8d462105e068ae9679c04753d24 
>   usr/src/lib/libzpool/common/sys/zfs_context.h 
> 9e4d8ed0b8ec42be75bb93f44602ac99e907cf00 
> 
> Diff: https://reviews.csiden.org/r/267/diff/
> 
> 
> Testing
> -------
> 
> Been running in Nexenta's QA process for the past 2+ months.
> 
> 
> Thanks,
> 
> Saso Kiselkov
> 
>

_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to