[akaros] [PATCH] Fix a deadlock bug in MCS-PDR locks

Barret Rhoden Fri, 04 Dec 2015 13:36:42 -0800

The MCS-PDR locks have an optimization built in where most vcores ensure the
lockholder runs.  The other style is to ensure whoever is directly ahead in
line (the predecessor) runs, which we fall back to in some tricky corner
cases.  That all is fine.


In the unlock case, the lockholder needs to ensure whoever is next in line
runs (the lockholder's successor).  Waiting on a_tail requires that
everyone in line is ensuring their predecessor runs, which is not what
happens by default.  By clearing the lockholder_vcoreid, we can fall back
to this "get out of corner cases by following the chain" approach.

I've been chasing this one off and on for a couple years.  I managed to
recreate it once or twice and was able to peak at the userspace contexts.
I could see all but one of them were calling ensure_vcore_runs(2), and VC 2
was calling ensure_vcore_runs(1).  Two other vcores were preempted.  I
could see VC 2 was in an unlock.  There were no syscalls coming from that
process.

It was actually simple after that.  What happened was that a vcore signed
up for the lock (L391), but hadn't set pred->next yet (L396).  Then it gets
preempted.  VC 2 was its pred, and it acquired the lock with no issues.
When it went to unlock, it needed to ensure it's successor was running.  VC
2 was the lockholder_vcoreid.  Everyone in the chain behind the preempted
VC kept ensuring 2 ran.  No one ensured the preempted VC ran.

Signed-off-by: Barret Rhoden <[email protected]>
---
 user/parlib/mcs.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/user/parlib/mcs.c b/user/parlib/mcs.c
index 7f03999dae0a..9c5a93459108 100644
--- a/user/parlib/mcs.c
+++ b/user/parlib/mcs.c
@@ -413,6 +413,7 @@ void __mcs_pdr_lock(struct mcs_pdr_lock *lock, struct 
mcs_pdr_qnode *qnode)
                         * so they can hand us the lock. */
                        if (vcore_is_preempted(pred_vcoreid) ||
                            seq != __procinfo.coremap_seqctr) {
+                               /* Note that we don't normally ensure our 
*pred* runs. */
                                if (lock->lockholder_vcoreid == 
MCSPDR_NO_LOCKHOLDER ||
                                    lock->lockholder_vcoreid == vcore_id())
                                        ensure_vcore_runs(pred_vcoreid);
@@ -451,7 +452,13 @@ void __mcs_pdr_unlock(struct mcs_pdr_lock *lock, struct 
mcs_pdr_qnode *qnode)
                while (qnode->next == 0) {
                        /* We need to get our next to run, but we don't know 
who they are.
                         * If we make sure a tail is running, that will 
percolate up to make
-                        * sure our qnode->next is running */
+                        * sure our qnode->next is running.
+                        *
+                        * But first, we need to tell everyone that there is no 
specific
+                        * lockholder.  lockholder_vcoreid is a short-circuit 
on the "walk
+                        * the chain" PDR.  Normally, that's okay.  But now we 
need to make
+                        * sure everyone is walking the chain from a_tail up to 
our pred. */
+                       lock->lockholder_vcoreid = MCSPDR_NO_LOCKHOLDER;
                        ensure_vcore_runs(a_tail_vcoreid);
                        cpu_relax();
                }
-- 
2.6.0.rc2.230.g3dd15c0

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

[akaros] [PATCH] Fix a deadlock bug in MCS-PDR locks

Reply via email to