Re: zpool import skips wedges due to a race condition

2021-09-11 Thread Taylor R Campbell
> Date: Tue, 7 Sep 2021 22:30:45 +0100
> From: Alexander Nasonov 
> 
> When I run zfs import, it launches 32 threads and opens 32 disks in
> parallel, including cgd1 and dk24. But it can't open dk24 while
> cgd1 is still open (it fails with EBUSY).

I have the attached patch in my tree from a while ago to scan
non-wedge disks first, then wedges.  For some reason I didn't commit
it but I forget why.  (Well, it's a little kludgey to look at the
pathname to distinguish dkN devices, but...)
>From dbee261ca7b5837946e93ec601acb04ae0b0a8b1 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Sat, 29 Feb 2020 18:19:45 +
Subject: [PATCH] Scan primary disks first; then dkwedges.

---
 .../dist/lib/libzfs/common/libzfs_import.c| 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c 
b/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
index b935e49076a5..e8d29b86f4b2 100644
--- a/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
+++ b/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
@@ -1175,6 +1175,9 @@ zpool_find_import_impl(libzfs_handle_t *hdl, importargs_t 
*iarg)
config_entry_t *ce, *cenext;
name_entry_t *ne, *nenext;
avl_tree_t slice_cache;
+#ifdef __NetBSD__
+   avl_tree_t dkwedge_cache;
+#endif
rdsk_node_t *slice;
void *cookie;
 
@@ -1266,6 +1269,8 @@ zpool_find_import_impl(libzfs_handle_t *hdl, importargs_t 
*iarg)
}
 #endif
 #ifdef __NetBSD__
+   avl_create(_cache, slice_cache_compare,
+   sizeof (rdsk_node_t), offsetof(rdsk_node_t, rn_node));
if (strcmp(rdsk, "/dev/") == 0) {
static const char mib_name[] = "hw.disknames";
size_t len;
@@ -1291,7 +1296,10 @@ zpool_find_import_impl(libzfs_handle_t *hdl, 
importargs_t *iarg)
slice->rn_dfd = dfd;
slice->rn_hdl = hdl;
slice->rn_nozpool = B_FALSE;
-   avl_add(_cache, slice);
+   if (strncmp(name, "dk", 2) == 0)
+   avl_add(_cache, slice);
+   else
+   avl_add(_cache, slice);
}
free(disknames);
 
@@ -1333,6 +1341,16 @@ skipdir:
AVL_AFTER)))
(void) tpool_dispatch(t, zpool_open_func, slice);
tpool_wait(t);
+#ifdef __NetBSD__
+   for (slice = avl_first(_cache); slice;
+   (slice = avl_walk(_cache, slice,
+   AVL_AFTER)))
+   (void) tpool_dispatch(t, zpool_open_func, slice);
+   tpool_wait(t);
+   while ((slice = avl_destroy_nodes(_cache,
+   )) != NULL)
+   avl_add(_cache, slice);
+#endif
tpool_destroy(t);
 
cookie = NULL;
@@ -1373,6 +1391,9 @@ skipdir:
free(slice->rn_name);
free(slice);
}
+#ifdef __NetBSD__
+   avl_destroy(_cache);
+#endif
avl_destroy(_cache);
 
(void) closedir(dirp);


Re: zpool import skips wedges due to a race condition

2021-09-08 Thread Martin Husemann
On Wed, Sep 08, 2021 at 08:51:15AM +0100, Patrick Welche wrote:
> I see something similar in another context: when I shutdown, shutdown
> can stall on a system with dkX on cgdN, with
> 
> detaching dkX
> detaching cgdN
> detaching dkX(same X)
> (hang)
> 
> If the dice are rolled correctly, and cgdN gets the detach before dkX,
> it shuts down properly...

No, this is something different. The kernel well knows all the dependencies.
This is a plain bug somewhere(tm).

Martin


Re: zpool import skips wedges due to a race condition

2021-09-08 Thread Martin Husemann
On Wed, Sep 08, 2021 at 07:57:35AM +0100, Alexander Nasonov wrote:
> The order doesn't have to be arbitrary, the kernel can be fixed
> to return disk names in a well-defined order.

You can do that today with a custom kernel and lots of hard wired
disk attachements (wd0 ... wdN instead of wd*), but it makes no sense
for GENERIC.

I'm not sure how you would change probe order later via a sysctl, sounds
easier to fix the zfs code instead and have it not probe random devices
for zfs but use specific configured ones.

Martin


Re: zpool import skips wedges due to a race condition

2021-09-08 Thread Patrick Welche
On Wed, Sep 08, 2021 at 06:38:02AM -, Michael van Elst wrote:
> al...@yandex.ru (Alexander Nasonov) writes:
> 
> >When I run zfs import, it launches 32 threads and opens 32 disks in
> >parallel, including cgd1 and dk24. But it can't open dk24 while
> >cgd1 is still open (it fails with EBUSY).
> 
> >I fixed it in the attatched patch by running only one thread. It's
> >not the best approach but I'm not sure how to fix it properly.
> 
> 
> There are other issues with scanning devices in an arbitrary order,
> a parallel scan just makes it worse by adding randomness.
> 
> LVM tries to solve this with an optional filter for device names
> when scanning for physical volumes.
> 
> The root detection code tries to solve this by scanning twice, once
> for wedges, once for everything else, and by identifying wedges
> that alias a partition.
> 
> For a complete solution you would need to know all the device relationships
> (dkX on wdY, cgdN on dmN, etc, but also e.g. dkX on cgdN). That still leaves
> out hot-plug devices where "upper" devices appear late.

I see something similar in another context: when I shutdown, shutdown
can stall on a system with dkX on cgdN, with

detaching dkX
detaching cgdN
detaching dkX(same X)
(hang)

If the dice are rolled correctly, and cgdN gets the detach before dkX,
it shuts down properly...


Cheers,

Patrick


Re: zpool import skips wedges due to a race condition

2021-09-08 Thread Alexander Nasonov
Michael van Elst wrote:
> There are other issues with scanning devices in an arbitrary order,
> a parallel scan just makes it worse by adding randomness.

The order doesn't have to be arbitrary, the kernel can be fixed
to return disk names in a well-defined order. A new zfs.disknames 
sysctl can be added to serve this specific purpose.

But I agree that doesn't solve a problem completely.

Alex


Re: zpool import skips wedges due to a race condition

2021-09-08 Thread Michael van Elst
al...@yandex.ru (Alexander Nasonov) writes:

>When I run zfs import, it launches 32 threads and opens 32 disks in
>parallel, including cgd1 and dk24. But it can't open dk24 while
>cgd1 is still open (it fails with EBUSY).

>I fixed it in the attatched patch by running only one thread. It's
>not the best approach but I'm not sure how to fix it properly.


There are other issues with scanning devices in an arbitrary order,
a parallel scan just makes it worse by adding randomness.

LVM tries to solve this with an optional filter for device names
when scanning for physical volumes.

The root detection code tries to solve this by scanning twice, once
for wedges, once for everything else, and by identifying wedges
that alias a partition.

For a complete solution you would need to know all the device relationships
(dkX on wdY, cgdN on dmN, etc, but also e.g. dkX on cgdN). That still leaves
out hot-plug devices where "upper" devices appear late.



zpool import skips wedges due to a race condition

2021-09-07 Thread Alexander Nasonov
(forwarding to current-users because tech-misc appears to be inactive)

- Forwarded message from Alexander Nasonov  -

Date: Sun, 5 Sep 2021 22:16:48 +0100
From: Alexander Nasonov 
To: tech-m...@netbsd.org
Subject: zpool import skips wedges due to a race condition

zfs import reliably fails to detect a pool on my server. The pool lives
on cgd1:

# dkctl cgd1 listwedges
/dev/rcgd1: 1 wedge:
dk24: zcgdroot, 6688832954 blocks at 34, type: zfs

When I run zfs import, it launches 32 threads and opens 32 disks in
parallel, including cgd1 and dk24. But it can't open dk24 while
cgd1 is still open (it fails with EBUSY).

I fixed it in the attatched patch by running only one thread. It's
not the best approach but I'm not sure how to fix it properly.

Alex

Index: libzfs_import.c
===
RCS file: 
/cvsroot/src/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c,v
retrieving revision 1.7
diff -p -u -u -r1.7 libzfs_import.c
--- libzfs_import.c 28 Aug 2021 10:47:45 -  1.7
+++ libzfs_import.c 5 Sep 2021 20:50:35 -
@@ -1326,9 +1326,11 @@ skipdir:
 * double the number of processors; we hold a lot of
 * locks in the kernel, so going beyond this doesn't
 * buy us much.
+* XXX It's not a very smart idea to open all disks in
+* parallel because wedges on NetBSD can't be open while
+* a parent disk is open. For now, only run one thread.
 */
-   t = tpool_create(1, 2 * sysconf(_SC_NPROCESSORS_ONLN),
-   0, NULL);
+   t = tpool_create(1, 1, 0, NULL);
for (slice = avl_first(_cache); slice;
(slice = avl_walk(_cache, slice,
AVL_AFTER)))


- End forwarded message -