On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote: > > > But urgh, nasty problem. Lemme ponder this a bit. > > OK, please have a very careful look at the below. It survived a boot > with udev -- which usually stresses mutex contention enough to explode > (in fact it did a few time when I got the contention/cancel path wrong), > however I have not ran anything else on it.
I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver workload. After 100 users, the system gets soft lockups. Some condition may be causing threads to not leave the "goto unqueue" loop. I added a debug counter, and threads were able to reach more than 1,000,000,000 "goto unqueue". I also was initially thinking if there can be problems when multiple threads need_resched() and unqueue at the same time. As an example, 2 nodes that need to reschedule are next to each other in the middle of the MCS queue. The 1st node executes "while (!(next = ACCESS_ONCE(node->next)))" and exits the while loop because next is not NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node, NULL) != node)". We may then end up in a situation where the node before the 1st node gets linked with the outdated 2nd node. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/