Thomas Gleixner <[email protected]> writes: > On Fri, Feb 19 2021 at 12:31, Vitaly Kuznetsov wrote: > >> When irq_matrix_assign()/irq_matrix_free() calls get unsynced, weird >> effects are possible, e.g. when cm->allocated goes negative CPU hotplug >> may get blocked. Add WARN_ON_ONCE() to simplify detecting such situations. >> >> Signed-off-by: Vitaly Kuznetsov <[email protected]> >> --- >> kernel/irq/matrix.c | 11 +++++++++-- >> 1 file changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c >> index 651a4ad6d711..2438a4f9d726 100644 >> --- a/kernel/irq/matrix.c >> +++ b/kernel/irq/matrix.c >> @@ -189,7 +189,9 @@ void irq_matrix_assign_system(struct irq_matrix *m, >> unsigned int bit, >> set_bit(bit, m->system_map); >> if (replace) { >> BUG_ON(!test_and_clear_bit(bit, cm->alloc_map)); >> + WARN_ON_ONCE(!cm->allocated); >> cm->allocated--; >> + WARN_ON_ONCE(!m->total_allocated); > > This hunk is not really useful. It already dies when the bit is not set > in the alloc map.
This was to check for the hypothetical issue when then number of bits set get out of sync with 'total_allocated' counter -- which is likely impossible today but could maybe be useful as a future proof. In case this seems to be too much I'm not against dropping it. > >> m->total_allocated--; >> } >> if (bit >= m->alloc_start && bit < m->alloc_end) >> @@ -424,12 +426,17 @@ void irq_matrix_free(struct irq_matrix *m, unsigned >> int cpu, >> return; >> >> clear_bit(bit, cm->alloc_map); >> + WARN_ON_ONCE(!cm->allocated); >> cm->allocated--; > > WARN and then decrement is not necessarily any better than just > decrementing unconditionally. It's just more noisy. > > Why would you let the counter wrap into negative space if you already > know it's 0? > > There is a way more useful way to handle this. In such a case the bit is > NOT set in the alloc map. So: > > if (!WARN_ON_ONCE(test_and_clear_bit(bit, cm->alloc_map))) > return; > > would have caught the problem at hand nicely and let the machine survive > while just throwing warns and continuing is broken to begin with. Thanks, I like the idea. I didn't do that probably because the problem which triggered me to write these patches wasn't fatal, it was just causing CPU0 offlining to fail. > > Thanks, > > tglx > -- Vitaly

