On 23/01/23(Mon) 22:57, David Hill wrote:
> On 1/20/23 09:02, Martin Pieuchot wrote:
> > > [...] 
> > > Ran it 20 times and all completed and passed.  I was also able to 
> > > interrupt
> > > it as well.   no issues.
> > > 
> > > Excellent!
> > 
> > Here's the best fix I could come up with.  We mark the VM map as "busy"
> > during the page fault just before the faulting thread releases the shared
> > lock.  This ensures no other thread will grab an exclusive lock until the
> > fault is finished.
> > 
> > I couldn't trigger the reproducer with this, can you?
> 
> Yes, same result as before.  This patch does not seem to help.

Is it the same as before?  I doubt it is.  On a 4-CPU machine I can't
trigger the race described in this thread.  On a 8-CPU one I now see all
threads sleeping on "thrsleep" except one in "kqread" and one in "wait".

I don't know what's happening.  I don't know how to debug Go code.  I
can't say if this is a race in UVM or something else.  The original race
reported in this thread seems fixed by the diff I sent.  I don't know if
there's another one or if I missed something.  I can't figure it out by
myself, I'd appreciated if somebody else with some knowledge in Go could
give me a hand.

Joel could you find some time to reproduce the hang with the diff below
applied?

To reproduce, just do:
$ doas pkg_add git go
$ git clone https://github.com/etcd-io/bbolt.git
$ cd bbolt
$ git checkout v1.3.6
$ go test -v -run TestSimulate_10000op_10p

Index: uvm/uvm_fault.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.133
diff -u -p -r1.133 uvm_fault.c
--- uvm/uvm_fault.c     4 Nov 2022 09:36:44 -0000       1.133
+++ uvm/uvm_fault.c     20 Jan 2023 13:52:58 -0000
@@ -1277,6 +1277,9 @@ uvm_fault_lower(struct uvm_faultinfo *uf
                /* update rusage counters */
                curproc->p_ru.ru_majflt++;
 
+               /* prevent siblings to grab an exclusive lock on the map */
+               vm_map_busy(ufi->map);
+
                uvmfault_unlockall(ufi, amap, NULL);
 
                counters_inc(uvmexp_counters, flt_get);
@@ -1293,13 +1296,16 @@ uvm_fault_lower(struct uvm_faultinfo *uf
                        KASSERT(result != VM_PAGER_PEND);
 
                        if (result == VM_PAGER_AGAIN) {
+                               vm_map_unbusy(ufi->map);
                                tsleep_nsec(&nowake, PVM, "fltagain2",
                                    MSEC_TO_NSEC(5));
                                return ERESTART;
                        }
 
-                       if (!UVM_ET_ISNOFAULT(ufi->entry))
+                       if (!UVM_ET_ISNOFAULT(ufi->entry)) {
+                               vm_map_unbusy(ufi->map);
                                return (EIO);
+                       }
 
                        uobjpage = PGO_DONTCARE;
                        uobj = NULL;
@@ -1308,6 +1314,7 @@ uvm_fault_lower(struct uvm_faultinfo *uf
 
                /* re-verify the state of the world.  */
                locked = uvmfault_relock(ufi);
+               vm_map_unbusy(ufi->map);
                if (locked && amap != NULL)
                        amap_lock(amap);
 

Reply via email to