----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.csiden.org/r/267/#review877 -----------------------------------------------------------
Hi Sašo, I just hit an interesting corner case on in the l2_rebuild_abort_lowmem logic. - Import several pools; rebuild goes fine. - Import pool that at system shutdown was dealing with a huge set (half a TiB) of deduped-dataset snapshot destroys. - This takes about 15-20 min waiting on the spa_namespace_lock mutex (in spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock mutex again (in spa_async_thread). - During this time, other activity makes the arc grow big - The rebuild sees arc_reclaim_needed(), and so the several GB of DDT (among other things) on the L2 are effectively discarded. I think that the abort-in-lowmem response is a bit aggressive. In this particular example, this is the first time arc_reclaim_needed() was true since the system was started. Maybe backing off and trying again once or twice before aborting is a reasonable thing to do in the generic case? Also, as lundman noted, so far in openzfs nobody else expects thread_join() to work in the kernel; my approach was to extend struct l2arc_dev to hold a mutex and a condvar, ``` kmutex_t l2ad_rebuild_mutex; kcondvar_t l2ad_rebuild_cv; ``` and to do this in l2ar_remove_vdev(): ``` if (remdev->l2ad_rebuild_did != 0) { /* * N.B. it should be safe to thread_join with the rebuild * thread while holding l2arc_dev_mtx because it is not * accessed from anywhere in the l2arc rebuild code below * (except for l2arc_spa_rebuild_start, which is ok). */ mutex_enter(&remdev->l2ad_rebuild_mutex); do { static int iter = 0; dprintf("ZFS: %s: waiting on condvar for rebuild thread %p, iter %d\n", __func__, remdev->l2ad_rebuild_did, ++iter); cv_timedwait(&remdev->l2ad_rebuild_cv, &remdev->l2ad_rebuild_mutex, ddi_get_lbolt()+(hz*5)); } while(!remdev->l2ad_rebuild_thread_exiting); mutex_exit(&remdev->l2ad_rebuild_mutex); dprintf("ZFS: %s: thread done %p, destroying cv and mutex, setting did to 0\n", __func__, remdev->l2ad_rebuild_did); cv_destroy(&remdev->l2ad_rebuild_cv); mutex_destroy(&remdev->l2ad_rebuild_mutex); remdev->l2ad_rebuild_did = 0; //thread_join(remdev->l2ad_rebuild_did); ``` and this in l2arc_spa_rebuild_start(): ``` if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) { VERIFY3U(dev->l2ad_rebuild_did, ==, 0); #ifdef _KERNEL mutex_init(&dev->l2ad_rebuild_mutex, NULL, MUTEX_DEFAULT, NULL); cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT, NULL); mutex_enter(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild_thread_exiting = FALSE; mutex_exit(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild_did = thread_create(NULL, 0, (void *)l2arc_dev_rebuild_start, dev, 0, &p0, TS_RUN, minclsyspri); #endif } ``` and this in l2arc_dev_rebuild_start() ``` mutex_enter(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild_thread_exiting = FALSE; if (!dev->l2ad_rebuild_cancel) { mutex_exit(&dev->l2ad_rebuild_mutex); VERIFY(dev->l2ad_rebuild); (void) l2arc_rebuild(dev); mutex_enter(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild = B_FALSE; mutex_exit(&dev->l2ad_rebuild_mutex); } else { mutex_exit(&dev->l2ad_rebuild_mutex); } mutex_enter(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild_thread_exiting = TRUE; mutex_exit(&dev->l2ad_rebuild_mutex); cv_broadcast(&dev->l2ad_rebuild_cv); kpreempt(KPREEMPT_SYNC); if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) { // we didn't get cleaned up in l2arc_remove_vdev cv_destroy(&dev->l2ad_rebuild_cv); mutex_destroy(&dev->l2ad_rebuild_mutex); dev->l2ad_rebuild_did = 0; } else if(dev->l2ad_rebuild_did != thr) { // zero: we probably got cleaned up in l2arc_remove_vdev, nonzero: impossible? __func__, thr, dev->l2ad_rebuild_did); } dprintf("ZFS: %s calling thread_exit(), thread %p/%p, l2ad_rebuild_cancel == %d\n", __func__, thr, dev->l2ad_rebuild_did, dev->l2ad_rebuild_cancel); thread_exit(); ``` Finally, I agree with you in your discussion with Matt Ahrens that the ~51 GB lower limit on rebuildable L2 vdevs is a bit strong. Consider the case where a smaller L2ARC is actually desirable because the average block size is very small and turnover is fairly high; the extra l2hdrs can be significant pollution, and will accumulate on a larger L2 device. However, there is still presumably value in holding on to some smaller number of small blocks. Sean. - Sean Doran On Nov. 5, 2015, 3:31 p.m., Saso Kiselkov wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.csiden.org/r/267/ > ----------------------------------------------------------- > > (Updated Nov. 5, 2015, 3:31 p.m.) > > > Review request for OpenZFS Developer Mailing List and Christopher Siden. > > > Repository: illumos-gate > > > Description > ------- > > This is an upstream port of the Persistent L2ARC feature from Nexenta's repo. > > > Diffs > ----- > > usr/src/uts/common/sys/fs/zfs.h bc9f057dd1361ae73a12375515abacd0fed820d2 > usr/src/uts/common/fs/zfs/vdev_label.c > f0924ab1e66eaa678540da8925995da6e0e2a29c > usr/src/uts/common/fs/zfs/vdev.c 1c57fce4dcee909b164353181dcd8e2a29ed7946 > usr/src/uts/common/fs/zfs/sys/spa.h > 7ac78390338a44f7b7658017e1ae8fcc9beb89d6 > usr/src/uts/common/fs/zfs/sys/arc.h > 899b72114b9909098080a5d6bbad1a60808f030c > usr/src/uts/common/fs/zfs/spa.c 95a6b0fae7760e8a1e8cfc1e657dc22fd9ef3245 > usr/src/uts/common/fs/zfs/arc.c 52f582be1633a8d462105e068ae9679c04753d24 > usr/src/lib/libzpool/common/sys/zfs_context.h > 9e4d8ed0b8ec42be75bb93f44602ac99e907cf00 > > Diff: https://reviews.csiden.org/r/267/diff/ > > > Testing > ------- > > Been running in Nexenta's QA process for the past 2+ months. > > > Thanks, > > Saso Kiselkov > >
_______________________________________________ developer mailing list developer@open-zfs.org http://lists.open-zfs.org/mailman/listinfo/developer