On Sun, Apr 15, 2018 at 3:34 PM, Al Viro <v...@zeniv.linux.org.uk> wrote: > On Sun, Apr 15, 2018 at 10:54:55PM +0100, Al Viro wrote: >> On Sun, Apr 15, 2018 at 09:40:54PM +0100, Al Viro wrote: >> >> > BTW, the current placement of cond_resched() looks bogus; suppose we >> > have collected a lot of victims and ran into need_resched(). We leave >> > d_walk() and call shrink_dentry_list(). At that point there's a lot >> > of stuff on our shrink list and anybody else running into them will >> > have to keep scanning. Giving up the timeslice before we take care >> > of any of those looks like a bad idea, to put it mildly, and that's >> > precisely what will happen. >> > >> > What about doing that in the end of __dentry_kill() instead? And to >> > hell with both existing call sites - dput() one (before going to >> > the parent) is obviously covered by that (dentry_kill() only returns >> > non-NULL after having called __dentry_kill()) and in shrink_dentry_list() >> > we'll get to it as soon as we go through all dentries that can be >> > immediately kicked off the shrink list. Which, AFAICS, improves the >> > situation, now that shrink_lock_dentry() contains no trylock loops... >> > >> > Comments? >> >> What I mean is something like this (cumulative diff, it'll obviously need >> to be carved up into 3--4 commits): > > ... and carved-up version is in vfs.git#work.dcache. Could syzbot folks > hit it with their reproducers?
To me it seems like shrink_dcache_parent could still spin without scheduling similar to before - found > 0 since *someone* is shrinking, but we have 0 entries to shrink - we never call __dentry_kill or schedule, and just spin. And indeed, the syzbot reproducer @vfs#work.dcache hits a softlockup in shrink_dcache_parent. I'd think re-adding cond_resched to shrink_dcache_parent does make the softlockup go way. It seems to fix the reproducer. Although did we want to terminate the loop in shrink_dcache_parent earlier?
Description: S/MIME Cryptographic Signature