Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Waldek Kozaczuk Sat, 10 Apr 2021 20:44:33 -0700


On Friday, April 9, 2021 at 10:37:55 AM UTC-4 Waldek Kozaczuk wrote:


> On Thu, Apr 8, 2021 at 11:15 AM Waldek Kozaczuk <[email protected]> 
> wrote:
>
>> On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]> 
>> wrote:
>>
>>>
>>>
>>> On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]> wrote:
>>>
>>>> I think you are right about wait_record and memory barriers (unless 
>>>> there is still some hole in our thinking :-)).
>>>>
>>>> So let us get back to one of the observations I made and the related 
>>>> question I posed here - 
>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337
>>>> . 
>>>>
>>>> The thread::wake_impl() method has a deliberate "if" logic to validate 
>>>> that the current (old) status of the thread is one of the initial states 
>>>> per the specified *allowed_initial_states_mask* argument. If it is 
>>>> NOT, the wake_impl() simply returns without actually waking a thread 
>>>> (setting need_reschedule to true or calling send_wakeup_ipi()). It is 
>>>> interesting that neither wake_impl() nor wake() returns a result of the 
>>>> "waking process" - the return type is void. I wonder what the rationale 
>>>> behind it was. Maybe in most cases, it does not matter whether we really 
>>>> wake it or not. But maybe in some, it does and we should be able to know 
>>>> that thread was not really woken.
>>>>
>>>> Normally the thread, the wake_impl() is called on (per st argument) 
>>>> would be most likely in the *waiting* state. But clearly, it does not 
>>>> have to be the case always - the same thread could have just been woken by 
>>>> another thread on another CPU for example. 
>>>>
>>>> I have added these debug printouts to see what states thread might be 
>>>> when woke_impl() is called:
>>>>
>>>> void thread::wake_impl(detached_state* st, unsigned 
>>>> allowed_initial_states_mask)
>>>> {
>>>>     status old_status = status::waiting;
>>>>     trace_sched_wake(st->t);
>>>>     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>>>         if (!((1 << unsigned(old_status)) & 
>>>> allowed_initial_states_mask)) {
>>>> *            if (allowed_initial_states_mask == ((1 << 
>>>> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {*
>>>> *                if (old_status != status::waiting && old_status != 
>>>> status::sending_lock) {*
>>>> *                   debug_early_u64("! wake_impl: ", 
>>>> (unsigned)old_status);*
>>>>
>>>> *                }    *
>>>> *            }    *
>>>>             return;
>>>>         }    
>>>>     }    
>>>>
>>>> Please note I am specifically logging only cases when wake_impl() is 
>>>> called from thread::wake_with_from_mutex(Action action). With one CPU, I 
>>>> occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I 
>>>> also occasionally see "waking" and "running", which is interesting. Most 
>>>> of 
>>>> the time even if I see these "waking" and "running" printouts, OSv would 
>>>> NOT hang. But sometimes it would right after.
>>>>
>>> I think that “queued” is not an issue as this thread is in the runqueue 
>>> and will eventually be running. With single CPU the running case is 
>>> impossible (one thread can be running) and waking is also ok. 
>>>
>>> I wonder if step 2 below is ever possible or should be?
>>>
>>> Or maybe we have some memory ordering issue with waitlist when popping a 
>>> waiter is not visible on other cpu and it is still in that list or maybe we 
>>> have duplicates of the same thread. 
>>>
>>>>
>>>> Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods 
>>>> - lock() and unlock() - I think we can imagine following scenario:
>>>>
>>>> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up 
>>>> creating a wait_record "waiter" on the stack (could NOT acquire the lock), 
>>>> adds "waiter" to the M waitqueue, and finally calls waiter.wait() which 
>>>> ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out 
>>>> and becomes "waiting".
>>>> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in 
>>>> code not related to lock-free mutex that calls wake_impl() to wake thread 
>>>> T1 (not sure what such scenario might be).
>>>> 3a. Thread T1 becomes running again on CPU0, but it checks the 
>>>> condition of the waiter which obviously is NOT true (!t) and it goes back 
>>>> to "waiting".
>>>> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 
>>>> on CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as 
>>>> part of "Responsibility Hand-Off" protocol and calls wake() (assuming it 
>>>> will simply wake the thread). But in the end, the wake_impl() "if" 
>>>> determines that T1 is running so it actually never wakes it and T1 stays 
>>>> stuck like this (no?) unless some other thread later simply wakes it again 
>>>> (t was set to null in the waiter). But what if such thread never comes to 
>>>> wake it?
>>>>
>>>> Am I misunderstanding something or describing a scenario that should 
>>>> never happen? Or maybe it is a symptom of another problem?
>>>>
>>>> If my thinking is correct, then:
>>>> 1. Shouldn't we check if wake_impl() actually worked all the way in the 
>>>> lockfree mutex code and push the waiter back again to its waitqueue if it 
>>>> did not? Or is it even more complicated and we should iterate to find a 
>>>> waiter it can actually wake?
>>>> 2. Shouldn't the implementation of thread::do_wake_with(Action action, 
>>>> unsigned allowed_initial_states_mask) change and call action ONLY if 
>>>> wake_impl() actually woke the thread (meaning the while loop suceeded)? We 
>>>> would need another wake_impl() that would take action as an argument.
>>>>  
>>>> If that is indeed a real problem, why it does not happen on Intel?
>>>>
>>>
>> I have applied the same patch and ran test same tests and I could see the 
>> same occasional "waking" and "running" printouts. But on x64 it never 
>> hangs. So maybe the scenarios that involve wake_impl() that fails to change 
>> state to waking are legitimate cases that are correctly handled but I still 
>> do not understand how.
>>
>> So back to the board. So I still think we are missing some memory 
>> barrier instructions somewhere. See below.
>>  
>>
>>>
>>>> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote:
>>>>
>>>>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
>>>>>>
>>>>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> As I have been researching a bit the SMP issue described here - 
>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have 
>>>>>>>> noticed that the 't' variable in 
>>>>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
>>>>>>>>  
>>>>>>>> might be related to why OSv hangs with two or more CPUs.
>>>>>>>>
>>>>>>>> As I am describing here - 
>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270
>>>>>>>>  
>>>>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may not 
>>>>>>>> be 
>>>>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario. 
>>>>>>>>
>>>>>>>
>>>>>>> This is an interesting question. Indeed, the waiter::t is not marked 
>>>>>>> atomic, and it's been a very long time since I wrote this code . Note 
>>>>>>> that 
>>>>>>> "t" is not public, and the only methods that access t, wake() and 
>>>>>>> wait().
>>>>>>>
>>>>>>> I *think* (if I remember correctly) my thinking was that each of 
>>>>>>> these wake() and wait() operations already has its *own* memory 
>>>>>>> barriers. 
>>>>>>> namely wake_impl() and prepare_wait() both access the state of the 
>>>>>>> thread 
>>>>>>> with CST ordering - 
>>>>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
>>>>>>>  
>>>>>>> I think (again, I hope I remember correctly), we were hoping that these 
>>>>>>> barriers also apply to the t access before of after them. 
>>>>>>>
>>>>>>> The thinking was that it is similar to how one uses a shared 
>>>>>>> variable protected by the mutex: You don't need the shared variable to 
>>>>>>> be 
>>>>>>> std::atomic or care about memory order or barriers - the mutex already 
>>>>>>> takes care of that for you.
>>>>>>>
>>>>>>
>>>>>> I think that is true on Intel but not necessarily on ARM. 
>>>>>> Specifically, in the SMP case, the changes made to that shared variable 
>>>>>> ob 
>>>>>> CPU 1 even within mutex critical section are not visible to CPU 2 
>>>>>> because 
>>>>>> the changes made by the former will still be in the L1 cache. Only when 
>>>>>> we 
>>>>>> use proper memory barrier instructions (which I have figured which ones 
>>>>>> to 
>>>>>> use and where exactly) OR std::atomic<> (which on ARM translates to the 
>>>>>> *LDAR/STLR 
>>>>>> instructions *that by nature make the changes to memory atomic and 
>>>>>> visible to all CPUs), the changes to _t are visible.
>>>>>>
>>>>>
>>>>> I don't think this is the case. Multi-threaded code which uses 
>>>>> traditional Posix mutex does *not* use any std::atomic or barriers. 
>>>>> People 
>>>>> wrote such code before C++11's std::atomic, and without barriers in 
>>>>> assembly language (again, I'm talking about Posix threads code with 
>>>>> mutexes 
>>>>> - not about lockfree code).
>>>>> The assumption is that the *implementation* of the mutex is using 
>>>>> these barriers: "acquire" and "release" barriers so that for example 
>>>>> changes that one thread does to a shared variable inside a critical 
>>>>> section 
>>>>> protected by a mutex will be visible by some other thread on another CPU 
>>>>> which later takes the same mutex.
>>>>>  
>>>>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in 
>>>>> the mutex implementation ;-)) but they do have barriers as I'll explain 
>>>>> below:
>>>>>
>>>>>
>>>>>> Otherwise the wait() method in wait_record.hh will not see _t equal 
>>>>>> to null when the thread waiting on CPU2 is woken. So the effect is that 
>>>>>> the 
>>>>>> interrupt arrives, thread is woken but does not see _t value equal to 0. 
>>>>>> But I might be wrong in my reasoning.
>>>>>>
>>>>>>
>>>>>>> But you're right that it's hard to catch bugs in this memory 
>>>>>>> ordering on x86 and it's easier to do it wrong on ARM.
>>>>>>>
>>>>>>> The question is what exactly is wrong:
>>>>>>>
>>>>>>>    1. Is the ordering in wake_impl() and prepare_wait() (called by 
>>>>>>>    waiter::wake() and waiter::wait()) not strong enough on ARM to 
>>>>>>> serialize 
>>>>>>>    *everything*, not just a specific variable?
>>>>>>>
>>>>>>> I think this is specific to _t. 
>>>>>>
>>>>>>>
>>>>>>>    1. Do we have this serialization in the right place? E.g., maybe 
>>>>>>>    we have the ordering point before writing t but we need it after 
>>>>>>> writing t, 
>>>>>>>    and so.
>>>>>>>    2. Maybe the problem isn't in waiter::wake() or waiter::wait() 
>>>>>>>    but in other functions like waiter::woken(), waiter::clear()?
>>>>>>>
>>>>>>> I suggest that you change waiter.hh to use std::atomic for t, and 
>>>>>>> then using "relaxed" ordering to read/write t for some of the waiter 
>>>>>>> methods and "cst" ordering for other methods - and see which method 
>>>>>>> needs 
>>>>>>> the non-relaxed ordering. When we know if one specific method is 
>>>>>>> missing 
>>>>>>> the ordering, we can try to understand why it is missing.
>>>>>>>
>>>>>> Which specific place in code do you have in mind? Nothing in 
>>>>>> wait_record.hh uses memory ordering involved methods.
>>>>>>
>>>>>
>>>>> *wait_record::wait()* calls sched::thread::wait_until(). This calls 
>>>>> do_wait_until() which starts creating a "wait_guard".  This calls 
>>>>> thread::prepare_wait(), which calls
>>>>>    * _detached_state->st.store(status::waiting);*
>>>>> because there is no memory-ordering parameter, it uses the default 
>>>>> memory_order_seq_cst 
>>>>> ordering.
>>>>> The thinking was that that this does all the necessary memory barriers.
>>>>>
>>>>> In the other side, *wait_record::wake() *calls 
>>>>> thread::wake_with_from_mutex() which calls do_wake_with(), which does
>>>>>         *auto ds = _detached_state.get();*
>>>>> this get() uses memory_order_seq_cst, and so it is the other side of 
>>>>> the store barrier we had in wait().
>>>>>
>>>> Regarding your explanation about  thread::prepare_wait() and 
do_wake_with(), the more I read about the acquire/release to make changes 
to some shared variable synced and visible between threads, should not this 
be reversed? If you look at do_wake_with(), it calls action() after "auto 
ds = _detached_state.get();" which is acquire operation, right?

But normally the order is this:
1) some thread tried to acquire a lock calls *wait_record::wait()* which 
then effectively loops in thread::do_wait_until() until it see t null, 
right?
2) another thread calls *wait_record::wake() *which then ultimately set t 
to null and calls wake_impl()

So given that shouldn't we have the acquire operation in the 
*wait_record::wait()* chain and release operation in the do_wake_with() 
after we call the action in wake_impl()? 

In theory, the prepare_wait has this line:

assert(_detached_state->st.load() == status::running); - the acquire 
operation

followed by the store which is release().


And the wake_impl() has the loop to call st.compare_exchange_weak() which I 
think is a release operation If it succeeds. So maybe we are OK except when 
we break from that loop and return because the thread state was neither 
waiting nor sending_lock.

My atomic_thread_fence() calls try to fill those missing gaps, I think.


>>>>>
>>>>> Again, this was the thinking, or at least what I remember from many 
>>>>> years ago when I was an expert in this code ;-)
>>>>> Maybe there is something subtly wrong with this thinking on ARM, while 
>>>>> it was correct for x86....
>>>>>
>>>>> So you are right about these memory barriers in prepare_wait() 
>> and do_wake_with() but I may have found another very related place where we 
>> are missing ones. 
>>
>> In thread::do_wait_until() we do use wait_guard that calls 
>> prepare_wait(). However, pepare_wait() uses store() which perhaps only 
>> issues *store* memory barrier instruction. And wait_until() check the 
>> shared variable - t - in this case. For t changed by action in 
>> do_wake_with(), are we not missing a load barrier before checking predicate?
>>
>> Relatedly, as you can see above, wake_impl() may fail to change the st 
>> variable to waiting for the reasons described above. In that case, perhaps 
>> compare_exchange_weak() call that returns false, does NOT issue a store 
>> barrier, does it? So maybe in this case before we return in that while loop 
>> in wake_impl() we should also issue a memory barrier instruction?
>>
>> I conducted the experiments when I added generic memory barrier 
>> instruction - dmb isht - to both prepare_wait() and wake_impl() in the 
>> places described and I could see my hello word execute 1-2K times before 
>> hanging in some other place. So not as good as using atomic for t but these 
>> extra memory barriers definitely help.
>>
>> @@ -1143,6 +1164,7 @@ void thread::prepare_wait()
>>      preempt_disable();
>>      assert(_detached_state->st.load() == status::running);
>>      _detached_state->st.store(status::waiting);
>>
> +            asm volatile ("dmb ish");
>>  }
>>
>> One thing I missed is that prepare_wait() calls both load() and then 
> store() on _detached_state->st which is interesting as it means that all 
> memory operations - both load and store - should be ordered BEFORE and 
> AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a 
> general memory barrier instruction that waits for both the store and load 
> instructions to complete) is *NOT* necessary. But when I remove it, I get 
> back the same hung failures.
>
> Maybe it is because when I use "asm volatile ("dmb ishst");" in 
> wake_impl() which is NOT specific to _detached_state->st, the rules of 
> memory barriers do not apply? Meaning the load() has to match store() of 
> the *SAME* atomic variable, no?
>
> For example, this statement from C++ memory ordering documentation seems 
> to indicate:
>
> "memory_order_release - A store operation with this memory order performs 
> the release operation: no reads or writes in the current thread can be 
> reordered after this store. All writes in the current thread are visible in 
> other threads that *acquire the same atomic variable* (see 
> Release-Acquire ordering below) and writes that carry a dependency into the 
> atomic variable become visible in other threads* that consume the same 
> atomic* (see Release-Consume ordering below). "
>

The more I did into atomics and acquire-release semantics, the more I learn 
that indeed load()/store() (and their derivatives) need to called against 
same atomic variable.

>
> So maybe because we use non-atomic barrier instruction in wake_impl(), we 
> have to use a similar one in prepare_wait() to make sure that changes to 
> shared variable *t* are visible between 2 threads.
>
> @@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned 
>> allowed_initial_states_mask)
>>      trace_sched_wake(st->t);     while 
>> (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>          if (!((1 << unsigned(old_status)) & 
>> allowed_initial_states_mask)) {
>> +            asm volatile ("dmb ishst");
>>              return;
>>          }
>>
> The C++ equivalent to the asm instructions would be this:

@@ -1143,6 +1164,8 @@ void thread::prepare_wait()
     preempt_disable();
     assert(_detached_state->st.load() == status::running);
     _detached_state->st.store(status::waiting);
+    std::atomic_thread_fence(std::memory_order_acquire);
 }
 
 // This function is responsible for changing a thread's state from
@@ -1192,6 +1215,15 @@ void thread::wake_impl(detached_state* st, unsigned 
allowed_initial_states_mask)
     trace_sched_wake(st->t);
     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+            std::atomic_thread_fence(std::memory_order_release);
             return;
         }
     }
@@ -1209,6 +1241,7 @@ void thread::wake_impl(detached_state* st, unsigned 
allowed_initial_states_mask)
 

>  
>>
>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "OSv Development" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/d85532fd-f013-4184-b41e-9ca7917d5e40n%40googlegroups.com.

Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Reply via email to