Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Fri, Oct 30, 2020 at 10:38:06AM +0100, Peter Zijlstra wrote: > > if (!chain_head && ret != 2) { > > if (!check_prevs_add(curr, hlock)) > > I'm not entirely sure that doesn't still trigger the problem. Consider > @chain_head := true. Urgh, clearly my brain really isn't working well. Ignore that.
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Fri, Oct 30, 2020 at 11:51:18AM +0800, Boqun Feng wrote: > On Wed, Oct 28, 2020 at 08:59:10PM +0100, Peter Zijlstra wrote: > Sorry for the late response. No worries, glad you could have a look. > > So that's commit f611e8cf98ec ("lockdep: Take read/write status in > > consideration when generate chainkey") that did that. > > > > Yeah, I think that's related, howver ... It's the commit that made the chainkey depend on the read state, and thus introduced this connondrum. > > So validate_chain() requires the new chain_key, but can change ->read > > which then invalidates the chain_key we just calculated. > > > > This happens when check_deadlock() returns 2, which only happens when: > > > > - next->read == 2 && ... ; however @hext is our @hlock, so that's > > pointless > > > > I don't think we should return 2 (earlier) in this case anymore. Because > now we have recursive read deadlock detection, it's safe to add dep: > "prev -> next" in the dependency graph. I think we can just continue in > this case. Actually I think this is something I'm missing in my > recursive read detection patchset :-/ Yes, I agree, this case should go. We now fully support recursive read depndencies per your recent work. > > - when there's a nest_lock involved ; ww_mutex uses that !!! > > > > That leaves check_deadlock() return 2 only if hlock is a nest_lock, and > ... > > @@ -3597,8 +3598,12 @@ static int validate_chain(struct task_struct *curr, > > * building dependencies (just like we jump over > > * trylock entries): > > */ > > - if (ret == 2) > > + if (ret == 2) { > > hlock->read = 2; > > + *chain_key = iterate_chain_key(hlock->prev_chain_key, > > hlock_id(hlock)); > > If "ret == 2" means hlock is a a nest_lock, than we don't need the > "->read = 2" trick here and we don't need to update chain_key either. > We used to have this "->read = 2" only because we want to skip the > dependency adding step afterwards. So how about the following: > > It survived a lockdep selftest at boot time. Right, but our self-tests didn't trigger this problem to begin with, let me go try and create one that does. > ->8 > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..b23ca6196561 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -2765,7 +2765,7 @@ print_deadlock_bug(struct task_struct *curr, struct > held_lock *prev, > * (Note that this has to be done separately, because the graph cannot > * detect such classes of deadlocks.) > * > - * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read > + * Returns: 0 on deadlock detected, 1 on OK, 2 on nest_lock > */ > static int > check_deadlock(struct task_struct *curr, struct held_lock *next) > @@ -2788,7 +2788,7 @@ check_deadlock(struct task_struct *curr, struct > held_lock *next) >* lock class (i.e. read_lock(lock)+read_lock(lock)): >*/ > if ((next->read == 2) && prev->read) > - return 2; > + continue; > > /* >* We're holding the nest_lock, which serializes this lock's > @@ -3592,16 +3592,9 @@ static int validate_chain(struct task_struct *curr, > > if (!ret) > return 0; > - /* > - * Mark recursive read, as we jump over it when > - * building dependencies (just like we jump over > - * trylock entries): > - */ > - if (ret == 2) > - hlock->read = 2; > /* >* Add dependency only if this lock is not the head > - * of the chain, and if it's not a secondary read-lock: > + * of the chain, and if it's not a nest_lock: >*/ > if (!chain_head && ret != 2) { > if (!check_prevs_add(curr, hlock)) I'm not entirely sure that doesn't still trigger the problem. Consider @chain_head := true. Anyway, let me go try and write this self-tests, maybe that'll get my snot-addled brains sufficiently aligned to make sense of all this :/
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Hi Peter, On Wed, Oct 28, 2020 at 08:59:10PM +0100, Peter Zijlstra wrote: > On Wed, Oct 28, 2020 at 08:42:09PM +0100, Peter Zijlstra wrote: > > On Wed, Oct 28, 2020 at 05:40:48PM +, Chris Wilson wrote: > > > Quoting Chris Wilson (2020-10-27 16:34:53) > > > > Quoting Peter Zijlstra (2020-10-27 15:45:33) > > > > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > > > > > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != > > > > > > 547d92e9ec2ab9af > > > > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at > > > > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 > > > > > > > > > > Urgh, I don't think I've _ever_ seen that warning trigger. > > > > > > > > > > The comments that go with it suggest memory corruption is the most > > > > > likely trigger of it. Is it easy to trigger? > > > > > > > > For the automated CI, yes, the few machines that run that particular HW > > > > test seem to hit it regularly. I have not yet reproduced it for myself. > > > > I thought it looked like something kasan would provide some insight for > > > > and we should get a kasan run through CI over the w/e. I suspect we've > > > > feed in some garbage and called it a lock. > > > > > > I tracked it down to a second invocation of > > > lock_acquire_shared_recursive() > > > intermingled with some other regular mutexes (in this case ww_mutex). > > > > > > We hit this path in validate_chain(): > > > /* > > >* Mark recursive read, as we jump over it when > > >* building dependencies (just like we jump over > > >* trylock entries): > > >*/ > > > if (ret == 2) > > > hlock->read = 2; > > > > > > and that is modifying hlock_id() and so the chain-key, after it has > > > already been computed. > > > > Ooh, interesting.. I'll have to go look at this in the morning, brain is > > fried already. Thanks for digging into it. > Sorry for the late response. > So that's commit f611e8cf98ec ("lockdep: Take read/write status in > consideration when generate chainkey") that did that. > Yeah, I think that's related, howver ... > So validate_chain() requires the new chain_key, but can change ->read > which then invalidates the chain_key we just calculated. > > This happens when check_deadlock() returns 2, which only happens when: > > - next->read == 2 && ... ; however @hext is our @hlock, so that's > pointless > I don't think we should return 2 (earlier) in this case anymore. Because now we have recursive read deadlock detection, it's safe to add dep: "prev -> next" in the dependency graph. I think we can just continue in this case. Actually I think this is something I'm missing in my recursive read detection patchset :-/ > - when there's a nest_lock involved ; ww_mutex uses that !!! > That leaves check_deadlock() return 2 only if hlock is a nest_lock, and ... > I suppose something like the below _might_ just do it, but I haven't > compiled it, and like said, my brain is fried. > > Boqun, could you have a look, you're a few timezones ahead of us so your > morning is earlier ;-) > > --- > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..3caf63532bc2 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -3556,7 +3556,7 @@ static inline int lookup_chain_cache_add(struct > task_struct *curr, > > static int validate_chain(struct task_struct *curr, > struct held_lock *hlock, > - int chain_head, u64 chain_key) > + int chain_head, u64 *chain_key) > { > /* >* Trylock needs to maintain the stack of held locks, but it > @@ -3568,6 +3568,7 @@ static int validate_chain(struct task_struct *curr, >* (If lookup_chain_cache_add() return with 1 it acquires >* graph_lock for us) >*/ > +again: > if (!hlock->trylock && hlock->check && > lookup_chain_cache_add(curr, hlock, chain_key)) { > /* > @@ -3597,8 +3598,12 @@ static int validate_chain(struct task_struct *curr, >* building dependencies (just like we jump over >* trylock entries): >*/ > - if (ret == 2) > + if (ret == 2) { > hlock->read = 2; > + *chain_key = iterate_chain_key(hlock->prev_chain_key, > hlock_id(hlock)); If "ret == 2" means hlock is a a nest_lock, than we don't need the "->read = 2" trick here and we don't need to update chain_key either. We used to have this "->read = 2" only because we want to skip the dependency adding step afterwards. So how about the following: It survived a lockdep selftest at boot time. Regards, Boqun ->8 diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..b23ca6196561 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2765,7 +2765,7 @@ print_deadlock_bug(struct task_struct
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting Chris Wilson (2020-10-28 17:40:48) > Quoting Chris Wilson (2020-10-27 16:34:53) > > Quoting Peter Zijlstra (2020-10-27 15:45:33) > > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != > > > > 547d92e9ec2ab9af > > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at > > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 > > > > > > Urgh, I don't think I've _ever_ seen that warning trigger. > > > > > > The comments that go with it suggest memory corruption is the most > > > likely trigger of it. Is it easy to trigger? > > > > For the automated CI, yes, the few machines that run that particular HW > > test seem to hit it regularly. I have not yet reproduced it for myself. > > I thought it looked like something kasan would provide some insight for > > and we should get a kasan run through CI over the w/e. I suspect we've > > feed in some garbage and called it a lock. > > I tracked it down to a second invocation of lock_acquire_shared_recursive() > intermingled with some other regular mutexes (in this case ww_mutex). > > We hit this path in validate_chain(): > /* > * Mark recursive read, as we jump over it when > * building dependencies (just like we jump over > * trylock entries): > */ > if (ret == 2) > hlock->read = 2; > > and that is modifying hlock_id() and so the chain-key, after it has > already been computed. > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 035f81b1cc87..f193f756e1e3 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -4831,7 +4831,7 @@ static int __lock_acquire(struct lockdep_map *lock, > unsigned int subclass, > if (!validate_chain(curr, hlock, chain_head, chain_key)) > return 0; > > - curr->curr_chain_key = chain_key; > + curr->curr_chain_key = iterate_chain_key(chain_key, hlock_id(hlock)); > curr->lockdep_depth++; > check_chain_key(curr); No, chain_key should not be chained onto itself again, but I hope it makes the issue clear nevertheless. -Chris
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting Chris Wilson (2020-10-27 16:34:53) > Quoting Peter Zijlstra (2020-10-27 15:45:33) > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != 547d92e9ec2ab9af > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 > > > > Urgh, I don't think I've _ever_ seen that warning trigger. > > > > The comments that go with it suggest memory corruption is the most > > likely trigger of it. Is it easy to trigger? > > For the automated CI, yes, the few machines that run that particular HW > test seem to hit it regularly. I have not yet reproduced it for myself. > I thought it looked like something kasan would provide some insight for > and we should get a kasan run through CI over the w/e. I suspect we've > feed in some garbage and called it a lock. I tracked it down to a second invocation of lock_acquire_shared_recursive() intermingled with some other regular mutexes (in this case ww_mutex). We hit this path in validate_chain(): /* * Mark recursive read, as we jump over it when * building dependencies (just like we jump over * trylock entries): */ if (ret == 2) hlock->read = 2; and that is modifying hlock_id() and so the chain-key, after it has already been computed. diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 035f81b1cc87..f193f756e1e3 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4831,7 +4831,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, if (!validate_chain(curr, hlock, chain_head, chain_key)) return 0; - curr->curr_chain_key = chain_key; + curr->curr_chain_key = iterate_chain_key(chain_key, hlock_id(hlock)); curr->lockdep_depth++; check_chain_key(curr); works as a heavy hammer. -Chris
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Wed, Oct 28, 2020 at 08:42:09PM +0100, Peter Zijlstra wrote: > On Wed, Oct 28, 2020 at 05:40:48PM +, Chris Wilson wrote: > > Quoting Chris Wilson (2020-10-27 16:34:53) > > > Quoting Peter Zijlstra (2020-10-27 15:45:33) > > > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > > > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != > > > > > 547d92e9ec2ab9af > > > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at > > > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 > > > > > > > > Urgh, I don't think I've _ever_ seen that warning trigger. > > > > > > > > The comments that go with it suggest memory corruption is the most > > > > likely trigger of it. Is it easy to trigger? > > > > > > For the automated CI, yes, the few machines that run that particular HW > > > test seem to hit it regularly. I have not yet reproduced it for myself. > > > I thought it looked like something kasan would provide some insight for > > > and we should get a kasan run through CI over the w/e. I suspect we've > > > feed in some garbage and called it a lock. > > > > I tracked it down to a second invocation of lock_acquire_shared_recursive() > > intermingled with some other regular mutexes (in this case ww_mutex). > > > > We hit this path in validate_chain(): > > /* > > * Mark recursive read, as we jump over it when > > * building dependencies (just like we jump over > > * trylock entries): > > */ > > if (ret == 2) > > hlock->read = 2; > > > > and that is modifying hlock_id() and so the chain-key, after it has > > already been computed. > > Ooh, interesting.. I'll have to go look at this in the morning, brain is > fried already. Thanks for digging into it. So that's commit f611e8cf98ec ("lockdep: Take read/write status in consideration when generate chainkey") that did that. So validate_chain() requires the new chain_key, but can change ->read which then invalidates the chain_key we just calculated. This happens when check_deadlock() returns 2, which only happens when: - next->read == 2 && ... ; however @hext is our @hlock, so that's pointless - when there's a nest_lock involved ; ww_mutex uses that !!! I suppose something like the below _might_ just do it, but I haven't compiled it, and like said, my brain is fried. Boqun, could you have a look, you're a few timezones ahead of us so your morning is earlier ;-) --- diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..3caf63532bc2 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3556,7 +3556,7 @@ static inline int lookup_chain_cache_add(struct task_struct *curr, static int validate_chain(struct task_struct *curr, struct held_lock *hlock, - int chain_head, u64 chain_key) + int chain_head, u64 *chain_key) { /* * Trylock needs to maintain the stack of held locks, but it @@ -3568,6 +3568,7 @@ static int validate_chain(struct task_struct *curr, * (If lookup_chain_cache_add() return with 1 it acquires * graph_lock for us) */ +again: if (!hlock->trylock && hlock->check && lookup_chain_cache_add(curr, hlock, chain_key)) { /* @@ -3597,8 +3598,12 @@ static int validate_chain(struct task_struct *curr, * building dependencies (just like we jump over * trylock entries): */ - if (ret == 2) + if (ret == 2) { hlock->read = 2; + *chain_key = iterate_chain_key(hlock->prev_chain_key, hlock_id(hlock)); + goto again; + } + /* * Add dependency only if this lock is not the head * of the chain, and if it's not a secondary read-lock: @@ -3620,7 +3625,7 @@ static int validate_chain(struct task_struct *curr, #else static inline int validate_chain(struct task_struct *curr, struct held_lock *hlock, -int chain_head, u64 chain_key) +int chain_head, u64 *chain_key) { return 1; } @@ -4834,7 +4839,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, WARN_ON_ONCE(!hlock_class(hlock)->key); } - if (!validate_chain(curr, hlock, chain_head, chain_key)) + if (!validate_chain(curr, hlock, chain_head, _key)) return 0; curr->curr_chain_key = chain_key;
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Wed, Oct 28, 2020 at 05:40:48PM +, Chris Wilson wrote: > Quoting Chris Wilson (2020-10-27 16:34:53) > > Quoting Peter Zijlstra (2020-10-27 15:45:33) > > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != > > > > 547d92e9ec2ab9af > > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at > > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 > > > > > > Urgh, I don't think I've _ever_ seen that warning trigger. > > > > > > The comments that go with it suggest memory corruption is the most > > > likely trigger of it. Is it easy to trigger? > > > > For the automated CI, yes, the few machines that run that particular HW > > test seem to hit it regularly. I have not yet reproduced it for myself. > > I thought it looked like something kasan would provide some insight for > > and we should get a kasan run through CI over the w/e. I suspect we've > > feed in some garbage and called it a lock. > > I tracked it down to a second invocation of lock_acquire_shared_recursive() > intermingled with some other regular mutexes (in this case ww_mutex). > > We hit this path in validate_chain(): > /* >* Mark recursive read, as we jump over it when >* building dependencies (just like we jump over >* trylock entries): >*/ > if (ret == 2) > hlock->read = 2; > > and that is modifying hlock_id() and so the chain-key, after it has > already been computed. Ooh, interesting.. I'll have to go look at this in the morning, brain is fried already. Thanks for digging into it.
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting Peter Zijlstra (2020-10-27 12:48:34) > On Tue, Oct 27, 2020 at 01:30:56PM +0100, Peter Zijlstra wrote: > > This seems to make it happy. Not quite sure that's the best solution. > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > index 3e99dfef8408..81295bc760fe 100644 > > --- a/kernel/locking/lockdep.c > > +++ b/kernel/locking/lockdep.c > > @@ -4411,7 +4405,9 @@ static int mark_lock(struct task_struct *curr, struct > > held_lock *this, > > break; > > > > case LOCK_USED: > > - debug_atomic_dec(nr_unused_locks); > > + case LOCK_USED_READ: > > + if ((hlock_class(this)->usage_mask & > > (LOCKF_USED|LOCKF_USED_READ)) == new_mask) > > + debug_atomic_dec(nr_unused_locks); > > break; > > > > default: > > This also works, and I think I likes it better.. anyone? > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..e603e86c0227 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -4396,6 +4390,9 @@ static int mark_lock(struct task_struct *curr, struct > held_lock *this, > if (unlikely(hlock_class(this)->usage_mask & new_mask)) > goto unlock; > > + if (!hlock_class(this)->usage_mask) > + debug_atomic_dec(nr_unused_locks); > + >From an outside perspective, this is much easier for me to match with the assertion in lockdep_proc. Our CI confirms this works, and we are just left with the new issue of <4> [260.903453] hm#2, depth: 6 [6], eb18a85a2df37d3d != a6ee4649c0022599 <4> [260.903458] WARNING: CPU: 7 PID: 5515 at kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 Thanks, -Chris
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting Peter Zijlstra (2020-10-27 15:45:33) > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != 547d92e9ec2ab9af > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at kernel/locking/lockdep.c:3679 > > check_chain_key+0x1a4/0x1f0 > > Urgh, I don't think I've _ever_ seen that warning trigger. > > The comments that go with it suggest memory corruption is the most > likely trigger of it. Is it easy to trigger? For the automated CI, yes, the few machines that run that particular HW test seem to hit it regularly. I have not yet reproduced it for myself. I thought it looked like something kasan would provide some insight for and we should get a kasan run through CI over the w/e. I suspect we've feed in some garbage and called it a lock. -Chris
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote: > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != 547d92e9ec2ab9af > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at kernel/locking/lockdep.c:3679 > check_chain_key+0x1a4/0x1f0 Urgh, I don't think I've _ever_ seen that warning trigger. The comments that go with it suggest memory corruption is the most likely trigger of it. Is it easy to trigger?
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting Peter Zijlstra (2020-10-27 12:30:56) > On Tue, Oct 27, 2020 at 12:59:55PM +0100, Peter Zijlstra wrote: > > On Tue, Oct 27, 2020 at 11:29:35AM +, Chris Wilson wrote: > > > Quoting tip-bot2 for Peter Zijlstra (2020-10-07 17:20:13) > > > > The following commit has been merged into the locking/core branch of > > > > tip: > > > > > > > > Commit-ID: 24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > > > Gitweb: > > > > https://git.kernel.org/tip/24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > > > Author:Peter Zijlstra > > > > AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 > > > > Committer: Peter Zijlstra > > > > CommitterDate: Wed, 07 Oct 2020 18:14:17 +02:00 > > > > > > > > lockdep: Fix usage_traceoverflow > > > > > > > > Basically print_lock_class_header()'s for loop is out of sync with the > > > > the size of of ->usage_traces[]. > > > > > > We're hitting a problem, > > > > > > $ cat /proc/lockdep_stats > > > > > > upon boot generates: > > > > > > [ 29.465702] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != > > > nr_unused) > > > [ 29.465716] WARNING: CPU: 0 PID: 488 at > > > kernel/locking/lockdep_proc.c:256 lockdep_stats_show+0xa33/0xac0 > > > > > > that bisected to this patch. Only just completed the bisection and > > > thought you would like a heads up. > > > > Oh hey, that's 'curious'... it does indeed trivially reproduce, let me > > have a poke. > > This seems to make it happy. Not quite sure that's the best solution. Finished the first round of testing on this patch (will try the second in a second). It solves the nr_unused_locks issue, but we find something else: <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != 547d92e9ec2ab9af <4> [304.908897] WARNING: CPU: 0 PID: 5658 at kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0 <4> [304.908898] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio i915 mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel btusb snd_intel_dspcfg btrtl snd_hda_codec btbcm btintel ghash_clmulni_intel snd_hwdep bluetooth snd_hda_core e1000e snd_pcm cdc_ether ptp usbnet mei_me mii pps_core mei ecdh_generic ecc intel_lpss_pci prime_numbers <4> [304.908920] CPU: 0 PID: 5658 Comm: kms_psr Not tainted 5.10.0-rc1-CI-Trybot_7174+ #1 <4> [304.908922] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019 <4> [304.908923] RIP: 0010:check_chain_key+0x1a4/0x1f0 <4> [304.908925] Code: a5 d8 08 00 00 74 e7 e8 7a eb 96 00 8b b5 e0 08 00 00 4c 89 e1 89 da 4c 8b 85 d8 08 00 00 48 c7 c7 d0 8f 30 82 e8 5f 2c 92 00 <0f> 0b 5b 5d 41 5c 41 5d c3 49 89 d5 49 c7 c4 ff ff ff ff 31 db e8 <4> [304.908926] RSP: 0018:c9ba7af0 EFLAGS: 00010086 <4> [304.908928] RAX: RBX: 0006 RCX: 0002 <4> [304.908929] RDX: 8002 RSI: 82348c47 RDI: <4> [304.908930] RBP: 88812d7dc040 R08: R09: c0012c92 <4> [304.908931] R10: 003b5380 R11: c9ba7900 R12: 3425cfea6ff31f7f <4> [304.908931] R13: 88812d7dc9f0 R14: 0003 R15: 88812d7dc9f0 <4> [304.908933] FS: 7f51722bb300() GS:88849fa0() knlGS: <4> [304.908934] CS: 0010 DS: ES: CR0: 80050033 <4> [304.908935] CR2: 7ffd197adff0 CR3: 00011d9ee004 CR4: 00770ef0 <4> [304.908935] PKRU: 5554 <4> [304.908936] Call Trace: <4> [304.908939] __lock_acquire+0x5d0/0x2740 <4> [304.908941] lock_acquire+0xdc/0x3c0 <4> [304.908944] ? drm_modeset_lock+0xf6/0x110 <4> [304.908947] __ww_mutex_lock.constprop.18+0xd0/0x1010 <4> [304.908949] ? drm_modeset_lock+0xf6/0x110 <4> [304.908951] ? drm_modeset_lock+0xf6/0x110 <4> [304.908953] ? ww_mutex_lock_interruptible+0x39/0xa0 <4> [304.908954] ww_mutex_lock_interruptible+0x39/0xa0 <4> [304.908956] drm_modeset_lock+0xf6/0x110 <4> [304.908958] drm_atomic_get_connector_state+0x28/0x180 <4> [304.909003] intel_psr_fastset_force+0x76/0x170 [i915] <4> [304.909034] i915_edp_psr_debug_set+0x53/0x70 [i915] <4> [304.909037] simple_attr_write+0xb1/0xd0 <4> [304.909040] full_proxy_write+0x51/0x80 <4> [304.909042] vfs_write+0xc4/0x230 <4> [304.909043] ksys_write+0x5a/0xd0 <4> [304.909045] do_syscall_64+0x33/0x80 <4> [304.909046] entry_SYSCALL_64_after_hwframe+0x44/0xa9 <4> [304.909047] RIP: 0033:0x7f517180d281 <4> [304.909049] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53 <4> [304.909050] RSP: 002b:7ffd197b0728 EFLAGS: 0246 ORIG_RAX: 0001 <4> [304.909051] RAX: ffda RBX: RCX: 7f517180d281 <4> [304.909052] RDX: 0003 RSI: 7f5171cb0dee RDI: 0009 <4>
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Tue, Oct 27, 2020 at 01:30:56PM +0100, Peter Zijlstra wrote: > This seems to make it happy. Not quite sure that's the best solution. > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..81295bc760fe 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -4411,7 +4405,9 @@ static int mark_lock(struct task_struct *curr, struct > held_lock *this, > break; > > case LOCK_USED: > - debug_atomic_dec(nr_unused_locks); > + case LOCK_USED_READ: > + if ((hlock_class(this)->usage_mask & > (LOCKF_USED|LOCKF_USED_READ)) == new_mask) > + debug_atomic_dec(nr_unused_locks); > break; > > default: This also works, and I think I likes it better.. anyone? diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..e603e86c0227 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4396,6 +4390,9 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, if (unlikely(hlock_class(this)->usage_mask & new_mask)) goto unlock; + if (!hlock_class(this)->usage_mask) + debug_atomic_dec(nr_unused_locks); + hlock_class(this)->usage_mask |= new_mask; if (new_bit < LOCK_TRACE_STATES) { @@ -4403,19 +4400,10 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, return 0; } - switch (new_bit) { - case 0 ... LOCK_USED-1: + if (new_bit < LOCK_USED) { ret = mark_lock_irq(curr, this, new_bit); if (!ret) return 0; - break; - - case LOCK_USED: - debug_atomic_dec(nr_unused_locks); - break; - - default: - break; } unlock:
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Tue, Oct 27, 2020 at 12:59:55PM +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 11:29:35AM +, Chris Wilson wrote: > > Quoting tip-bot2 for Peter Zijlstra (2020-10-07 17:20:13) > > > The following commit has been merged into the locking/core branch of tip: > > > > > > Commit-ID: 24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > > Gitweb: > > > https://git.kernel.org/tip/24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > > Author:Peter Zijlstra > > > AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 > > > Committer: Peter Zijlstra > > > CommitterDate: Wed, 07 Oct 2020 18:14:17 +02:00 > > > > > > lockdep: Fix usage_traceoverflow > > > > > > Basically print_lock_class_header()'s for loop is out of sync with the > > > the size of of ->usage_traces[]. > > > > We're hitting a problem, > > > > $ cat /proc/lockdep_stats > > > > upon boot generates: > > > > [ 29.465702] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != > > nr_unused) > > [ 29.465716] WARNING: CPU: 0 PID: 488 at > > kernel/locking/lockdep_proc.c:256 lockdep_stats_show+0xa33/0xac0 > > > > that bisected to this patch. Only just completed the bisection and > > thought you would like a heads up. > > Oh hey, that's 'curious'... it does indeed trivially reproduce, let me > have a poke. This seems to make it happy. Not quite sure that's the best solution. diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..81295bc760fe 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4411,7 +4405,9 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, break; case LOCK_USED: - debug_atomic_dec(nr_unused_locks); + case LOCK_USED_READ: + if ((hlock_class(this)->usage_mask & (LOCKF_USED|LOCKF_USED_READ)) == new_mask) + debug_atomic_dec(nr_unused_locks); break; default:
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
On Tue, Oct 27, 2020 at 11:29:35AM +, Chris Wilson wrote: > Quoting tip-bot2 for Peter Zijlstra (2020-10-07 17:20:13) > > The following commit has been merged into the locking/core branch of tip: > > > > Commit-ID: 24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > Gitweb: > > https://git.kernel.org/tip/24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > > Author:Peter Zijlstra > > AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 > > Committer: Peter Zijlstra > > CommitterDate: Wed, 07 Oct 2020 18:14:17 +02:00 > > > > lockdep: Fix usage_traceoverflow > > > > Basically print_lock_class_header()'s for loop is out of sync with the > > the size of of ->usage_traces[]. > > We're hitting a problem, > > $ cat /proc/lockdep_stats > > upon boot generates: > > [ 29.465702] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != > nr_unused) > [ 29.465716] WARNING: CPU: 0 PID: 488 at kernel/locking/lockdep_proc.c:256 > lockdep_stats_show+0xa33/0xac0 > > that bisected to this patch. Only just completed the bisection and > thought you would like a heads up. Oh hey, that's 'curious'... it does indeed trivially reproduce, let me have a poke.
Re: [tip: locking/core] lockdep: Fix usage_traceoverflow
Quoting tip-bot2 for Peter Zijlstra (2020-10-07 17:20:13) > The following commit has been merged into the locking/core branch of tip: > > Commit-ID: 24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > Gitweb: > https://git.kernel.org/tip/24d5a3bffef117ed90685f285c6c9d2faa3a02b4 > Author:Peter Zijlstra > AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 > Committer: Peter Zijlstra > CommitterDate: Wed, 07 Oct 2020 18:14:17 +02:00 > > lockdep: Fix usage_traceoverflow > > Basically print_lock_class_header()'s for loop is out of sync with the > the size of of ->usage_traces[]. We're hitting a problem, $ cat /proc/lockdep_stats upon boot generates: [ 29.465702] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != nr_unused) [ 29.465716] WARNING: CPU: 0 PID: 488 at kernel/locking/lockdep_proc.c:256 lockdep_stats_show+0xa33/0xac0 that bisected to this patch. Only just completed the bisection and thought you would like a heads up. -Chris
[tip: locking/core] lockdep: Fix usage_traceoverflow
The following commit has been merged into the locking/core branch of tip: Commit-ID: 2bb8945bcc1a768f2bc402a16c9610bba8d5187d Gitweb: https://git.kernel.org/tip/2bb8945bcc1a768f2bc402a16c9610bba8d5187d Author:Peter Zijlstra AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 Committer: Ingo Molnar CommitterDate: Fri, 09 Oct 2020 08:53:08 +02:00 lockdep: Fix usage_traceoverflow Basically print_lock_class_header()'s for loop is out of sync with the the size of of ->usage_traces[]. Also clean things up a bit while at it, to avoid such mishaps in the future. Fixes: 23870f122768 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions") Reported-by: Qian Cai Debugged-by: Boqun Feng Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Tested-by: Qian Cai Link: https://lkml.kernel.org/r/20200930094937.ge2...@hirez.programming.kicks-ass.net --- include/linux/lockdep_types.h | 8 +-- kernel/locking/lockdep.c | 32 + kernel/locking/lockdep_internals.h | 7 -- 3 files changed, 26 insertions(+), 21 deletions(-) diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h index bb35b44..9a1fd49 100644 --- a/include/linux/lockdep_types.h +++ b/include/linux/lockdep_types.h @@ -35,8 +35,12 @@ enum lockdep_wait_type { /* * We'd rather not expose kernel/lockdep_states.h this wide, but we do need * the total number of states... :-( + * + * XXX_LOCK_USAGE_STATES is the number of lines in lockdep_states.h, for each + * of those we generates 4 states, Additionally we report on USED and USED_READ. */ -#define XXX_LOCK_USAGE_STATES (1+2*4) +#define XXX_LOCK_USAGE_STATES 2 +#define LOCK_TRACE_STATES (XXX_LOCK_USAGE_STATES*4 + 2) /* * NR_LOCKDEP_CACHING_CLASSES ... Number of classes @@ -106,7 +110,7 @@ struct lock_class { * IRQ/softirq usage tracking bits: */ unsigned long usage_mask; - const struct lock_trace *usage_traces[XXX_LOCK_USAGE_STATES]; + const struct lock_trace *usage_traces[LOCK_TRACE_STATES]; /* * Generation counter, when doing certain classes of graph walking, diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 2facbbd..a430fbb 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -585,6 +585,8 @@ static const char *usage_str[] = #include "lockdep_states.h" #undef LOCKDEP_STATE [LOCK_USED] = "INITIAL USE", + [LOCK_USED_READ] = "INITIAL READ USE", + /* abused as string storage for verify_lock_unused() */ [LOCK_USAGE_STATES] = "IN-NMI", }; #endif @@ -1939,7 +1941,7 @@ static void print_lock_class_header(struct lock_class *class, int depth) #endif printk(KERN_CONT " {\n"); - for (bit = 0; bit < LOCK_USAGE_STATES; bit++) { + for (bit = 0; bit < LOCK_TRACE_STATES; bit++) { if (class->usage_mask & (1 << bit)) { int len = depth; @@ -3969,7 +3971,7 @@ static int separate_irq_context(struct task_struct *curr, static int mark_lock(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit) { - unsigned int old_mask, new_mask, ret = 1; + unsigned int new_mask, ret = 1; if (new_bit >= LOCK_USAGE_STATES) { DEBUG_LOCKS_WARN_ON(1); @@ -3996,30 +3998,26 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, if (unlikely(hlock_class(this)->usage_mask & new_mask)) goto unlock; - old_mask = hlock_class(this)->usage_mask; hlock_class(this)->usage_mask |= new_mask; - /* -* Save one usage_traces[] entry and map both LOCK_USED and -* LOCK_USED_READ onto the same entry. -*/ - if (new_bit == LOCK_USED || new_bit == LOCK_USED_READ) { - if (old_mask & (LOCKF_USED | LOCKF_USED_READ)) - goto unlock; - new_bit = LOCK_USED; + if (new_bit < LOCK_TRACE_STATES) { + if (!(hlock_class(this)->usage_traces[new_bit] = save_trace())) + return 0; } - if (!(hlock_class(this)->usage_traces[new_bit] = save_trace())) - return 0; - switch (new_bit) { + case 0 ... LOCK_USED-1: + ret = mark_lock_irq(curr, this, new_bit); + if (!ret) + return 0; + break; + case LOCK_USED: debug_atomic_dec(nr_unused_locks); break; + default: - ret = mark_lock_irq(curr, this, new_bit); - if (!ret) - return 0; + break; } unlock: diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index b0be156..de49f9e 100644 --- a/kernel/locking/lockdep_internals.h +++
[tip: locking/core] lockdep: Fix usage_traceoverflow
The following commit has been merged into the locking/core branch of tip: Commit-ID: 24d5a3bffef117ed90685f285c6c9d2faa3a02b4 Gitweb: https://git.kernel.org/tip/24d5a3bffef117ed90685f285c6c9d2faa3a02b4 Author:Peter Zijlstra AuthorDate:Wed, 30 Sep 2020 11:49:37 +02:00 Committer: Peter Zijlstra CommitterDate: Wed, 07 Oct 2020 18:14:17 +02:00 lockdep: Fix usage_traceoverflow Basically print_lock_class_header()'s for loop is out of sync with the the size of of ->usage_traces[]. Clean things up a bit. Fixes: 23870f122768 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions") Reported-by: Qian Cai Debugged-by: Boqun Feng Signed-off-by: Peter Zijlstra (Intel) Tested-by: Qian Cai Link: https://lkml.kernel.org/r/20200930094937.ge2...@hirez.programming.kicks-ass.net --- include/linux/lockdep_types.h | 8 +-- kernel/locking/lockdep.c | 32 + kernel/locking/lockdep_internals.h | 7 -- 3 files changed, 26 insertions(+), 21 deletions(-) diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h index bb35b44..9a1fd49 100644 --- a/include/linux/lockdep_types.h +++ b/include/linux/lockdep_types.h @@ -35,8 +35,12 @@ enum lockdep_wait_type { /* * We'd rather not expose kernel/lockdep_states.h this wide, but we do need * the total number of states... :-( + * + * XXX_LOCK_USAGE_STATES is the number of lines in lockdep_states.h, for each + * of those we generates 4 states, Additionally we report on USED and USED_READ. */ -#define XXX_LOCK_USAGE_STATES (1+2*4) +#define XXX_LOCK_USAGE_STATES 2 +#define LOCK_TRACE_STATES (XXX_LOCK_USAGE_STATES*4 + 2) /* * NR_LOCKDEP_CACHING_CLASSES ... Number of classes @@ -106,7 +110,7 @@ struct lock_class { * IRQ/softirq usage tracking bits: */ unsigned long usage_mask; - const struct lock_trace *usage_traces[XXX_LOCK_USAGE_STATES]; + const struct lock_trace *usage_traces[LOCK_TRACE_STATES]; /* * Generation counter, when doing certain classes of graph walking, diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index cdb3892..b3b3161 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -600,6 +600,8 @@ static const char *usage_str[] = #include "lockdep_states.h" #undef LOCKDEP_STATE [LOCK_USED] = "INITIAL USE", + [LOCK_USED_READ] = "INITIAL READ USE", + /* abused as string storage for verify_lock_unused() */ [LOCK_USAGE_STATES] = "IN-NMI", }; #endif @@ -2252,7 +2254,7 @@ static void print_lock_class_header(struct lock_class *class, int depth) #endif printk(KERN_CONT " {\n"); - for (bit = 0; bit < LOCK_USAGE_STATES; bit++) { + for (bit = 0; bit < LOCK_TRACE_STATES; bit++) { if (class->usage_mask & (1 << bit)) { int len = depth; @@ -4345,7 +4347,7 @@ static int separate_irq_context(struct task_struct *curr, static int mark_lock(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit) { - unsigned int old_mask, new_mask, ret = 1; + unsigned int new_mask, ret = 1; if (new_bit >= LOCK_USAGE_STATES) { DEBUG_LOCKS_WARN_ON(1); @@ -4372,30 +4374,26 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, if (unlikely(hlock_class(this)->usage_mask & new_mask)) goto unlock; - old_mask = hlock_class(this)->usage_mask; hlock_class(this)->usage_mask |= new_mask; - /* -* Save one usage_traces[] entry and map both LOCK_USED and -* LOCK_USED_READ onto the same entry. -*/ - if (new_bit == LOCK_USED || new_bit == LOCK_USED_READ) { - if (old_mask & (LOCKF_USED | LOCKF_USED_READ)) - goto unlock; - new_bit = LOCK_USED; + if (new_bit < LOCK_TRACE_STATES) { + if (!(hlock_class(this)->usage_traces[new_bit] = save_trace())) + return 0; } - if (!(hlock_class(this)->usage_traces[new_bit] = save_trace())) - return 0; - switch (new_bit) { + case 0 ... LOCK_USED-1: + ret = mark_lock_irq(curr, this, new_bit); + if (!ret) + return 0; + break; + case LOCK_USED: debug_atomic_dec(nr_unused_locks); break; + default: - ret = mark_lock_irq(curr, this, new_bit); - if (!ret) - return 0; + break; } unlock: diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index b0be156..de49f9e 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -20,9 +20,12 @@ enum lock_usage_bit { #undef LOCKDEP_STATE