On Mon, Aug 05, 2024 at 01:14:12PM -0700, Doug Anderson wrote: > On Mon, Aug 5, 2024 at 10:24 AM Will Deacon <[email protected]> wrote: > > On Mon, Aug 05, 2024 at 10:13:11AM -0700, Doug Anderson wrote: > > > On Mon, Aug 5, 2024 at 9:53 AM Will Deacon <[email protected]> wrote: > > > > On Tue, Jun 25, 2024 at 04:07:22PM -0700, Douglas Anderson wrote: > > > > > @@ -1084,79 +1088,87 @@ static inline unsigned int > > > > > num_other_online_cpus(void) > > > > > > > > > > void smp_send_stop(void) > > > > > { > > > > > + static unsigned long stop_in_progress; > > > > > + cpumask_t mask; > > > > > unsigned long timeout; > > > > > > > > > > - if (num_other_online_cpus()) { > > > > > - cpumask_t mask; > > > > > + /* > > > > > + * If this cpu is the only one alive at this point in time, > > > > > online or > > > > > + * not, there are no stop messages to be sent around, so just > > > > > back out. > > > > > + */ > > > > > + if (num_other_online_cpus() == 0) > > > > > + goto skip_ipi; > > > > > > > > > > - cpumask_copy(&mask, cpu_online_mask); > > > > > - cpumask_clear_cpu(smp_processor_id(), &mask); > > > > > + /* Only proceed if this is the first CPU to reach this code */ > > > > > + if (test_and_set_bit(0, &stop_in_progress)) > > > > > + return; > > > > > > > > > > - if (system_state <= SYSTEM_RUNNING) > > > > > - pr_crit("SMP: stopping secondary CPUs\n"); > > > > > - smp_cross_call(&mask, IPI_CPU_STOP); > > > > > - } > > > > > + cpumask_copy(&mask, cpu_online_mask); > > > > > + cpumask_clear_cpu(smp_processor_id(), &mask); > > > > > > > > > > - /* Wait up to one second for other CPUs to stop */ > > > > > + if (system_state <= SYSTEM_RUNNING) > > > > > + pr_crit("SMP: stopping secondary CPUs\n"); > > > > > + > > > > > + /* > > > > > + * Start with a normal IPI and wait up to one second for other > > > > > CPUs to > > > > > + * stop. We do this first because it gives other processors a > > > > > chance > > > > > + * to exit critical sections / drop locks and makes the rest of > > > > > the > > > > > + * stop process (especially console flush) more robust. > > > > > + */ > > > > > + smp_cross_call(&mask, IPI_CPU_STOP); > > > > > > > > I realise you've moved this out of crash_smp_send_stop() and it looks > > > > like we inherited the code from x86, but do you know how this serialise > > > > against CPU hotplug operations? I've spent the last 20m looking at the > > > > code and I can't see what prevents other CPUs from coming and going > > > > while we're trying to IPI a non-atomic copy of 'cpu_online_mask'. > > > > > > I don't think there is anything. ...and it's not just this code > > > either. It sure looks like nmi_trigger_cpumask_backtrace() has the > > > same problem. > > > > > > I guess maybe in the case of nmi_trigger_cpumask_backtrace() it's not > > > such a big deal because: > > > 1. If a CPU goes away then we'll just time out > > > 2. If a CPU shows up then we'll skip backtracing it, but we were > > > sending backtraces at an instant in time anyway. > > > > > > In the case of smp_send_stop() it's probably fine if a CPU goes away > > > because, again, we'll just timeout. ...but if a CPU shows up then > > > that's not super ideal. Maybe it doesn't cause problems in practice > > > but it does feel like it should be fixed. > > > > On the other hand, I really don't fancy taking a CPU hotplug mutex on > > the panic() path, so it's hard to know what's best. > > > > I suppose one thing we could do is recompute the mask before sending the > > NMI, which should make it a little more robust in that case. It still > > feels fragile, but it's no worse than the existing code, I suppose. > > Happy to do this and send a new version if you want. Just let me know. > Basically it's just replacing `cpumask_and(&mask, &mask, > cpu_online_mask)` with `cpumask_copy(&mask, cpu_online_mask); > cpumask_clear_cpu(smp_processor_id(), &mask);` in the case where we > fallback to NMI. If you're happy with the patch I'm also happy if you > make this change while applying.
Please send a v3 with that along with a little comment summarising this discussion so I don't confuse myself again when I look at next time :) Thanks, Will
