(resend to cc a few more lists -- sorry) I think the overall idea of this change is sound -- and that's a great performance improvement.
Can you change dmu_objset_find_parallel() to work like dmu_objset_find_dp(), and then change zil_check_log_chain(0 and zil_claim() to work off of the dsl_dataset_t*, rather than the char*name? Then I think the other threads won't need to grab the namespace lock. --matt On Thu, Oct 16, 2014 at 12:05 PM, Arne Jansen via illumos-developer < [email protected]> wrote: > In our setup zpool import takes a very long time (about 15 minutes) due > to a large count of filesystems (> 5000). > During this time, you can see the disks at only a few percent busy in > iostat, while the pool is at 100%, which indicates a single-threaded > import process. > What's causing this is checking of the zil chains for each filesystem. > Each filesystem is visisted twice, once to check the chain and once > to claim it. This is done in a sequential manner. > I've created a proof-of-concept patch that switches this process to a > parallel enumeration using a taskq. > In my setup this speeds up the import process by a factor of 20, > bringing all disks (30 in my case) to nearly 100% busy. > There's only one problem: locking. spa_open is called with > spa_namespace_lock held. With this, later on zil_check_log_chain and > zil_claim are called. These call in turn spa_open, which tries to claim > spa_namespace_lock again. This problem is not new. The current solution > looks like this: > > /* > * As disgusting as this is, we need to support recursive calls to this > * function because dsl_dir_open() is called during spa_load(), and ends > * up calling spa_open() again. The real fix is to figure out how to > * avoid dsl_dir_open() calling this in the first place. > */ > if (mutex_owner(&spa_namespace_lock) != curthread) { > mutex_enter(&spa_namespace_lock); > locked = B_TRUE; > } > > This doesn't work anymore when I call zil_claim/check_log_chain through > a taskq and thus from a different thread. My current hacky solution is > to pass a flag through the call chain to spa_open to signal that the > lock isn't needed. It works, but it doesn't look too nice... > > My question now is how to build a cleaner solution. The snippet above > is already ugly in itself, so it would be good to get rid of it at all. > The functions involved in the call chains are all very commonly used > functions. Changing the signature of those will probably give a quite > bulky patch, so I'd prefer to find a nice small unintrusive way. > > The patch currently looks like this: > > http://cr.illumos.org/~webrev/sensille/find_parallel/ > > Hopefully you can give me some hints on how to solve this recursive > locking riddle... > > -Arne > > > > ------------------------------------------- > illumos-developer > Archives: https://www.listbox.com/member/archive/182179/=now > RSS Feed: https://www.listbox.com/member/archive/rss/182179/ > 21175174-cd73734d > Modify Your Subscription: https://www.listbox.com/ > member/?member_id=21175174&id_secret=21175174-792643f6 > Powered by Listbox: http://www.listbox.com >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
