Hi Herton,

On 08/10/2015 05:31 PM, Herton R. Krzesinski wrote:
Well without the synchronize_rcu() and with the semid list loop fix I was still
able to get issues, and I thought the problem is related to racing with IPC_RMID
on freeary again. This is one scenario I would imagine:

                A                                                  B

freeary()
   list_del(&un->list_id)
   spin_lock(&un->ulp->lock)
   un->semid = -1
   list_del_rcu(&un->list_proc)
     __list_del_entry(&un->list_proc)
       __list_del(entry->prev, entry->next)      exit_sem()
         next->prev = prev;                        ...
         prev->next = next;                        ...
         ...                                       un = 
list_entry_rcu(ulp->list_proc.next...)
     (&un->list_proc)->prev = LIST_POISON2         if (&un->list_proc == 
&ulp->list_proc) <true, last un removed by thread A>
   ...                                             kfree(ulp)
   spin_unlock(&un->ulp->lock) <---- bug

Now that is a very tight window, but I had problems even when I tried this patch
first:

(...)
-               if (&un->list_proc == &ulp->list_proc)
-                       semid = -1;
-                else
-                       semid = un->semid;
+               if (&un->list_proc == &ulp->list_proc) {
+                       rcu_read_unlock();
What about:
+ spin_unlock_wait(&ulp->lock);
+                       break;
+               }
+               spin_lock(&ulp->lock);
+               semid = un->semid;
+               spin_unlock(&ulp->lock);

+               /* exit_sem raced with IPC_RMID, nothing to do */
                 if (semid == -1) {
                         rcu_read_unlock();
-                       break;
+                       synchronize_rcu();
+                       continue;
                 }
(...)

So even with the bad/uneeded synchronize_rcu() which I had placed there, I could
still get issues (however the testing on patch above was on an older kernel than
latest upstream, from RHEL 6, I can test without synchronize_rcu() on latest
upstream, however the affected code is the same). That's when I thought of
scenario above. I was able to get this oops:
Adding sleep() usually help, too. But it is ugly, so let's try to understand the race and to fix it.

Best regards,
    Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to