On 30/08/16 22:19, Peter Zijlstra wrote: > On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote: >> >> >> The origin of the issue I've seen seems to be related to >> rwsem spin lock stealing. Basically I see the system deadlock'd in the >> following state > > As Nick says (good to see you're back Nick!), this is unrelated to > rwsems. > > This is true for pretty much every blocking wait loop out there, they > all do: > > for (;;) { > current->state = UNINTERRUPTIBLE; > smp_mb(); > if (cond) > break; > schedule(); > } > current->state = RUNNING; > > Which, if the wakeup is spurious, is just the pattern you need.
Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer in my commit log. Should I resend the patch? > >> +++ b/kernel/sched/core.c >> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int >> state, int wake_flags) >> success = 1; /* we're going to change ->state */ >> cpu = task_cpu(p); >> >> + /* >> + * Ensure we see on_rq and p_state consistently >> + * >> + * For example in __rwsem_down_write_failed(), we have >> + * [S] ->on_rq = 1 [L] ->state >> + * MB RMB > > There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks > to PPC, is _not_ MB. It is however sufficient for this case. > The MB comes from the __switch_to() in schedule(). Ben mentioned it in a different thread. >> + * [S] ->state = TASK_UNINTERRUPTIBLE [L] ->on_rq >> + * In the absence of the RMB p->on_rq can be observed to be 0 >> + * and we end up spinning indefinitely in while (p->on_cpu) >> + */ > > > /* > * Ensure we load p->on_rq _after_ p->state, otherwise it would > * be possible to, falsely, observe p->on_rq == 0 and get stuck > * in smp_cond_load_acquire() below. > * > * sched_ttwu_pending() try_to_wake_up() > * [S] p->on_rq = 1; [L] P->state > * UNLOCK rq->lock > * > * schedule() RMB > * LOCK rq->lock > * UNLOCK rq->lock > * > * [task p] > * [S] p->state = UNINTERRUPTIBLE [L] p->on_rq > * > * Pairs with the UNLOCK+LOCK on rq->lock from the > * last wakeup of our task and the schedule that got our task > * current. > */ > >> + smp_rmb(); >> if (p->on_rq && ttwu_remote(p, wake_flags)) >> goto stat; >> > > > Now, this has been present for a fair while, I suspect ever since we > reworked the wakeup path to not use rq->lock twice. Curious you only now > hit it. > Yes, I just hit it a a week or two back and I needed to collect data to explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me I needed a system with large threads and less memory running stress-ng. Reproducing the problem takes an unpredictable amount of time. Balbir Singh.