Hi everyone, Michael Ellerman <m...@ellerman.id.au> writes: > Greg Kurz <gr...@kaod.org> writes: >> On Tue, 04 Aug 2020 23:35:10 +1000 >> Michael Ellerman <m...@ellerman.id.au> wrote: >>> Spinning forever seems like a bad idea, but as has been demonstrated at >>> least twice now, continuing when we don't know the state of the other >>> CPU can lead to straight up crashes. >>> >>> So I think I'm persuaded that it's preferable to have the kernel stuck >>> spinning rather than oopsing. >>> >> >> +1 >> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My >>> first instinct is no, if we're stuck here for 20s a stack trace would be >>> good. But then we will probably hit that on some big and/or heavily >>> loaded machine. >>> >>> So possibly we should call cond_resched() but have some custom logic in >>> the loop to print a warning if we are stuck for more than some >>> sufficiently long amount of time. >> >> How long should that be ? > > Yeah good question. > > I guess step one would be seeing how long it can take on the 384 vcpu > machine. And we can probably test on some other big machines. > > Hopefully Nathan can give us some idea of how long he's seen it take on > large systems? I know he was concerned about the 20s timeout of the > softlockup detector.
Maybe I'm not quite clear what this is referring to, but I don't think stop-self/query-stopped-state latency increases with processor count, at least not on PowerVM. And IIRC I was concerned with the earlier patch's potential for causing the softlockup watchdog to rightly complain by polling the stopped state without ever scheduling away. The fact that smp_query_cpu_stopped() kind of collapses the two distinct results from the query-cpu-stopped-state RTAS call into one return value may make it harder than necessary to reason about the questions around cond_resched() and whether to warn. Sorry to pull this stunt but I have had some code sitting in a neglected branch that I think gets the logic around this right. What we should have is a simple C wrapper for the RTAS call that reflects the architected inputs and outputs: ================================================================ (-- rtas.c --) /** * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. * @hwcpu: Identifies the processor thread to be queried. * @status: Pointer to status, valid only on success. * * Determine whether the given processor thread is in the stopped * state. If successful and @status is non-NULL, the thread's status * is stored to @status. * * Return: * * 0 - Success * * -1 - Hardware error * * -2 - Busy, try again later */ int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) { unsigned int cpu_status; int token; int fwrc; token = rtas_token("query-cpu-stopped-state"); fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu); if (fwrc != 0) goto out; if (status != NULL) *status = cpu_status; out: return fwrc; } ================================================================ And then a utility function that waits for the remote thread to enter stopped state, with higher-level logic for rescheduling and warning. The fact that smp_query_cpu_stopped() currently does not handle a -2/busy status is a bug, fixed below by using rtas_busy_delay(). Note the justification for the explicit cond_resched() in the outer loop: ================================================================ (-- rtas.h --) /* query-cpu-stopped-state CPU_status */ #define RTAS_QCSS_STATUS_STOPPED 0 #define RTAS_QCSS_STATUS_IN_PROGRESS 1 #define RTAS_QCSS_STATUS_NOT_STOPPED 2 (-- pseries/hotplug-cpu.c --) /** * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. */ static void wait_for_cpu_stopped(unsigned int cpu) { unsigned int status; unsigned int hwcpu; hwcpu = get_hard_smp_processor_id(cpu); do { int fwrc; /* * rtas_busy_delay() will yield only if RTAS returns a * busy status. Since query-cpu-stopped-state can * yield RTAS_QCSS_STATUS_IN_PROGRESS or * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded * period before the target thread stops, we must take * care to explicitly reschedule while polling. */ cond_resched(); do { fwrc = rtas_query_cpu_stopped_state(hwcpu, &status); } while (rtas_busy_delay(fwrc)); if (fwrc == 0) continue; pr_err_ratelimited("query-cpu-stopped-state for " "thread 0x%x returned %d\n", hwcpu, fwrc); goto out; } while (status == RTAS_QCSS_STATUS_NOT_STOPPED || status == RTAS_QCSS_STATUS_IN_PROGRESS); if (status != RTAS_QCSS_STATUS_STOPPED) { pr_err_ratelimited("query-cpu-stopped-state yielded unknown " "status %d for thread 0x%x\n", status, hwcpu); } out: return; } [...] static void pseries_cpu_die(unsigned int cpu) { wait_for_cpu_stopped(cpu); paca_ptrs[cpu]->cpu_start = 0; } ================================================================ wait_for_cpu_stopped() should be able to accommodate a time-based warning if necessary, but speaking as a likely recipient of any bug reports that would arise here, I'm not convinced of the need and I don't know what a good value would be. It's relatively easy to sample the stack of a task that's apparently failing to make progress, plus I probably would use 'perf probe' or similar to report the inputs and outputs for the RTAS call. I'm happy to make this a proper submission after I can clean it up and retest it, or Michael R. is welcome to appropriate it, assuming it's acceptable.