Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Waldek Kozaczuk Sun, 18 Apr 2021 21:50:17 -0700

The more I read about C++ 11 memory model and atomics, the more I seem to 
think that maybe we are missing the correct "synchronize-with" 
relationships in the right places needed for changes to _t in wait_record 
(and maybe other places) to be visible between threads on different cpus.


More specifically, wake_impl() called from thread::do_wake_with() AFTER 
calling action (_t = nullptr), may break from the loop and therefore fail 
to store a new value of *_detached_state->st*. In such a scenario, there is 
potentially no "release" operation that would tie to a corresponding load 
of *_detached_state->st* ("acquire")  on other CPU.

I think the only way to provide such a "synchronize-with" relationship for 
_t is to turn it into atomic and use release/acquire memory order for the 
corresponding load()/store() calls in wait_record.hh.

On Saturday, April 10, 2021 at 11:44:28 PM UTC-4 Waldek Kozaczuk wrote:

> On Friday, April 9, 2021 at 10:37:55 AM UTC-4 Waldek Kozaczuk wrote:
>
>> On Thu, Apr 8, 2021 at 11:15 AM Waldek Kozaczuk <[email protected]> 
>> wrote:
>>
>>> On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]> 
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]> 
>>>> wrote:
>>>>
>>>>> I think you are right about wait_record and memory barriers (unless 
>>>>> there is still some hole in our thinking :-)).
>>>>>
>>>>> So let us get back to one of the observations I made and the related 
>>>>> question I posed here - 
>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337
>>>>> . 
>>>>>
>>>>> The thread::wake_impl() method has a deliberate "if" logic to validate 
>>>>> that the current (old) status of the thread is one of the initial states 
>>>>> per the specified *allowed_initial_states_mask* argument. If it is 
>>>>> NOT, the wake_impl() simply returns without actually waking a thread 
>>>>> (setting need_reschedule to true or calling send_wakeup_ipi()). It is 
>>>>> interesting that neither wake_impl() nor wake() returns a result of the 
>>>>> "waking process" - the return type is void. I wonder what the rationale 
>>>>> behind it was. Maybe in most cases, it does not matter whether we really 
>>>>> wake it or not. But maybe in some, it does and we should be able to know 
>>>>> that thread was not really woken.
>>>>>
>>>>> Normally the thread, the wake_impl() is called on (per st argument) 
>>>>> would be most likely in the *waiting* state. But clearly, it does not 
>>>>> have to be the case always - the same thread could have just been woken 
>>>>> by 
>>>>> another thread on another CPU for example. 
>>>>>
>>>>> I have added these debug printouts to see what states thread might be 
>>>>> when woke_impl() is called:
>>>>>
>>>>> void thread::wake_impl(detached_state* st, unsigned 
>>>>> allowed_initial_states_mask)
>>>>> {
>>>>>     status old_status = status::waiting;
>>>>>     trace_sched_wake(st->t);
>>>>>     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>>>>         if (!((1 << unsigned(old_status)) & 
>>>>> allowed_initial_states_mask)) {
>>>>> *            if (allowed_initial_states_mask == ((1 << 
>>>>> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {*
>>>>> *                if (old_status != status::waiting && old_status != 
>>>>> status::sending_lock) {*
>>>>> *                   debug_early_u64("! wake_impl: ", 
>>>>> (unsigned)old_status);*
>>>>>
>>>>> *                }    *
>>>>> *            }    *
>>>>>             return;
>>>>>         }    
>>>>>     }    
>>>>>
>>>>> Please note I am specifically logging only cases when wake_impl() is 
>>>>> called from thread::wake_with_from_mutex(Action action). With one CPU, I 
>>>>> occasionally see that "queued" is logged. With 2 CPUs, besides "queued", 
>>>>> I 
>>>>> also occasionally see "waking" and "running", which is interesting. Most 
>>>>> of 
>>>>> the time even if I see these "waking" and "running" printouts, OSv would 
>>>>> NOT hang. But sometimes it would right after.
>>>>>
>>>> I think that “queued” is not an issue as this thread is in the runqueue 
>>>> and will eventually be running. With single CPU the running case is 
>>>> impossible (one thread can be running) and waking is also ok. 
>>>>
>>>> I wonder if step 2 below is ever possible or should be?
>>>>
>>>> Or maybe we have some memory ordering issue with waitlist when popping 
>>>> a waiter is not visible on other cpu and it is still in that list or maybe 
>>>> we have duplicates of the same thread. 
>>>>
>>>>>
>>>>> Now if we focus on lockless mutex (core/lfmutex.cc) and two key 
>>>>> methods - lock() and unlock() - I think we can imagine following scenario:
>>>>>
>>>>> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends 
>>>>> up creating a wait_record "waiter" on the stack (could NOT acquire the 
>>>>> lock), adds "waiter" to the M waitqueue, and finally calls waiter.wait() 
>>>>> which ends up calling sched::do_wait_until(). Eventually T1 gets 
>>>>> scheduled 
>>>>> out and becomes "waiting".
>>>>> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in 
>>>>> code not related to lock-free mutex that calls wake_impl() to wake thread 
>>>>> T1 (not sure what such scenario might be).
>>>>> 3a. Thread T1 becomes running again on CPU0, but it checks the 
>>>>> condition of the waiter which obviously is NOT true (!t) and it goes back 
>>>>> to "waiting".
>>>>> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 
>>>>> on CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as 
>>>>> part of "Responsibility Hand-Off" protocol and calls wake() (assuming it 
>>>>> will simply wake the thread). But in the end, the wake_impl() "if" 
>>>>> determines that T1 is running so it actually never wakes it and T1 stays 
>>>>> stuck like this (no?) unless some other thread later simply wakes it 
>>>>> again 
>>>>> (t was set to null in the waiter). But what if such thread never comes to 
>>>>> wake it?
>>>>>
>>>>> Am I misunderstanding something or describing a scenario that should 
>>>>> never happen? Or maybe it is a symptom of another problem?
>>>>>
>>>>> If my thinking is correct, then:
>>>>> 1. Shouldn't we check if wake_impl() actually worked all the way in 
>>>>> the lockfree mutex code and push the waiter back again to its waitqueue 
>>>>> if 
>>>>> it did not? Or is it even more complicated and we should iterate to find 
>>>>> a 
>>>>> waiter it can actually wake?
>>>>> 2. Shouldn't the implementation of thread::do_wake_with(Action action, 
>>>>> unsigned allowed_initial_states_mask) change and call action ONLY if 
>>>>> wake_impl() actually woke the thread (meaning the while loop suceeded)? 
>>>>> We 
>>>>> would need another wake_impl() that would take action as an argument.
>>>>>  
>>>>> If that is indeed a real problem, why it does not happen on Intel?
>>>>>
>>>>
>>> I have applied the same patch and ran test same tests and I could see 
>>> the same occasional "waking" and "running" printouts. But on x64 it never 
>>> hangs. So maybe the scenarios that involve wake_impl() that fails to change 
>>> state to waking are legitimate cases that are correctly handled but I still 
>>> do not understand how.
>>>
>>> So back to the board. So I still think we are missing some memory 
>>> barrier instructions somewhere. See below.
>>>  
>>>
>>>>
>>>>> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote:
>>>>>
>>>>>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> As I have been researching a bit the SMP issue described here - 
>>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have 
>>>>>>>>> noticed that the 't' variable in 
>>>>>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
>>>>>>>>>  
>>>>>>>>> might be related to why OSv hangs with two or more CPUs.
>>>>>>>>>
>>>>>>>>> As I am describing here - 
>>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270
>>>>>>>>>  
>>>>>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may 
>>>>>>>>> not be 
>>>>>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario. 
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is an interesting question. Indeed, the waiter::t is not 
>>>>>>>> marked atomic, and it's been a very long time since I wrote this code 
>>>>>>>> . 
>>>>>>>> Note that "t" is not public, and the only methods that access t, 
>>>>>>>> wake() and 
>>>>>>>> wait().
>>>>>>>>
>>>>>>>> I *think* (if I remember correctly) my thinking was that each of 
>>>>>>>> these wake() and wait() operations already has its *own* memory 
>>>>>>>> barriers. 
>>>>>>>> namely wake_impl() and prepare_wait() both access the state of the 
>>>>>>>> thread 
>>>>>>>> with CST ordering - 
>>>>>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
>>>>>>>>  
>>>>>>>> I think (again, I hope I remember correctly), we were hoping that 
>>>>>>>> these 
>>>>>>>> barriers also apply to the t access before of after them. 
>>>>>>>>
>>>>>>>> The thinking was that it is similar to how one uses a shared 
>>>>>>>> variable protected by the mutex: You don't need the shared variable to 
>>>>>>>> be 
>>>>>>>> std::atomic or care about memory order or barriers - the mutex already 
>>>>>>>> takes care of that for you.
>>>>>>>>
>>>>>>>
>>>>>>> I think that is true on Intel but not necessarily on ARM. 
>>>>>>> Specifically, in the SMP case, the changes made to that shared variable 
>>>>>>> ob 
>>>>>>> CPU 1 even within mutex critical section are not visible to CPU 2 
>>>>>>> because 
>>>>>>> the changes made by the former will still be in the L1 cache. Only when 
>>>>>>> we 
>>>>>>> use proper memory barrier instructions (which I have figured which ones 
>>>>>>> to 
>>>>>>> use and where exactly) OR std::atomic<> (which on ARM translates to the 
>>>>>>> *LDAR/STLR 
>>>>>>> instructions *that by nature make the changes to memory atomic and 
>>>>>>> visible to all CPUs), the changes to _t are visible.
>>>>>>>
>>>>>>
>>>>>> I don't think this is the case. Multi-threaded code which uses 
>>>>>> traditional Posix mutex does *not* use any std::atomic or barriers. 
>>>>>> People 
>>>>>> wrote such code before C++11's std::atomic, and without barriers in 
>>>>>> assembly language (again, I'm talking about Posix threads code with 
>>>>>> mutexes 
>>>>>> - not about lockfree code).
>>>>>> The assumption is that the *implementation* of the mutex is using 
>>>>>> these barriers: "acquire" and "release" barriers so that for example 
>>>>>> changes that one thread does to a shared variable inside a critical 
>>>>>> section 
>>>>>> protected by a mutex will be visible by some other thread on another CPU 
>>>>>> which later takes the same mutex.
>>>>>>  
>>>>>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in 
>>>>>> the mutex implementation ;-)) but they do have barriers as I'll explain 
>>>>>> below:
>>>>>>
>>>>>>
>>>>>>> Otherwise the wait() method in wait_record.hh will not see _t equal 
>>>>>>> to null when the thread waiting on CPU2 is woken. So the effect is that 
>>>>>>> the 
>>>>>>> interrupt arrives, thread is woken but does not see _t value equal to 
>>>>>>> 0. 
>>>>>>> But I might be wrong in my reasoning.
>>>>>>>
>>>>>>>
>>>>>>>> But you're right that it's hard to catch bugs in this memory 
>>>>>>>> ordering on x86 and it's easier to do it wrong on ARM.
>>>>>>>>
>>>>>>>> The question is what exactly is wrong:
>>>>>>>>
>>>>>>>>    1. Is the ordering in wake_impl() and prepare_wait() (called by 
>>>>>>>>    waiter::wake() and waiter::wait()) not strong enough on ARM to 
>>>>>>>> serialize 
>>>>>>>>    *everything*, not just a specific variable?
>>>>>>>>
>>>>>>>> I think this is specific to _t. 
>>>>>>>
>>>>>>>>
>>>>>>>>    1. Do we have this serialization in the right place? E.g., 
>>>>>>>>    maybe we have the ordering point before writing t but we need it 
>>>>>>>> after 
>>>>>>>>    writing t, and so.
>>>>>>>>    2. Maybe the problem isn't in waiter::wake() or waiter::wait() 
>>>>>>>>    but in other functions like waiter::woken(), waiter::clear()?
>>>>>>>>
>>>>>>>> I suggest that you change waiter.hh to use std::atomic for t, and 
>>>>>>>> then using "relaxed" ordering to read/write t for some of the waiter 
>>>>>>>> methods and "cst" ordering for other methods - and see which method 
>>>>>>>> needs 
>>>>>>>> the non-relaxed ordering. When we know if one specific method is 
>>>>>>>> missing 
>>>>>>>> the ordering, we can try to understand why it is missing.
>>>>>>>>
>>>>>>> Which specific place in code do you have in mind? Nothing in 
>>>>>>> wait_record.hh uses memory ordering involved methods.
>>>>>>>
>>>>>>
>>>>>> *wait_record::wait()* calls sched::thread::wait_until(). This calls 
>>>>>> do_wait_until() which starts creating a "wait_guard".  This calls 
>>>>>> thread::prepare_wait(), which calls
>>>>>>    * _detached_state->st.store(status::waiting);*
>>>>>> because there is no memory-ordering parameter, it uses the default 
>>>>>> memory_order_seq_cst 
>>>>>> ordering.
>>>>>> The thinking was that that this does all the necessary memory 
>>>>>> barriers.
>>>>>>
>>>>>> In the other side, *wait_record::wake() *calls 
>>>>>> thread::wake_with_from_mutex() which calls do_wake_with(), which does
>>>>>>         *auto ds = _detached_state.get();*
>>>>>> this get() uses memory_order_seq_cst, and so it is the other side of 
>>>>>> the store barrier we had in wait().
>>>>>>
>>>>> Regarding your explanation about  thread::prepare_wait() and 
> do_wake_with(), the more I read about the acquire/release to make changes 
> to some shared variable synced and visible between threads, should not this 
> be reversed? If you look at do_wake_with(), it calls action() after "auto 
> ds = _detached_state.get();" which is acquire operation, right?
>

The _detached_state is std::unique_ptr<detached_state>, so calling get() 
does not provide any load barrier. I think you may have meant another place 
where load() is called on the atomic *_detached_state->st.* 


> But normally the order is this:
> 1) some thread tried to acquire a lock calls *wait_record::wait()* which 
> then effectively loops in thread::do_wait_until() until it see t null, 
> right?
> 2) another thread calls *wait_record::wake() *which then ultimately set t 
> to null and calls wake_impl()
>
> So given that shouldn't we have the acquire operation in the 
> *wait_record::wait()* chain and release operation in the do_wake_with() 
> after we call the action in wake_impl()? 
>
> In theory, the prepare_wait has this line:
>
> assert(_detached_state->st.load() == status::running); - the acquire 
> operation
>
> followed by the store which is release().
>
>
> And the wake_impl() has the loop to call st.compare_exchange_weak() which 
> I think is a release operation If it succeeds. So maybe we are OK except 
> when we break from that loop and return because the thread state was 
> neither waiting nor sending_lock.
>
> My atomic_thread_fence() calls try to fill those missing gaps, I think.
>
>
>>>>>>
>>>>>> Again, this was the thinking, or at least what I remember from many 
>>>>>> years ago when I was an expert in this code ;-)
>>>>>> Maybe there is something subtly wrong with this thinking on ARM, 
>>>>>> while it was correct for x86....
>>>>>>
>>>>>> So you are right about these memory barriers in prepare_wait() 
>>> and do_wake_with() but I may have found another very related place where we 
>>> are missing ones. 
>>>
>>> In thread::do_wait_until() we do use wait_guard that calls 
>>> prepare_wait(). However, pepare_wait() uses store() which perhaps only 
>>> issues *store* memory barrier instruction. And wait_until() check the 
>>> shared variable - t - in this case. For t changed by action in 
>>> do_wake_with(), are we not missing a load barrier before checking predicate?
>>>
>>> Relatedly, as you can see above, wake_impl() may fail to change the st 
>>> variable to waiting for the reasons described above. In that case, perhaps 
>>> compare_exchange_weak() call that returns false, does NOT issue a store 
>>> barrier, does it? So maybe in this case before we return in that while loop 
>>> in wake_impl() we should also issue a memory barrier instruction?
>>>
>>> I conducted the experiments when I added generic memory barrier 
>>> instruction - dmb isht - to both prepare_wait() and wake_impl() in the 
>>> places described and I could see my hello word execute 1-2K times before 
>>> hanging in some other place. So not as good as using atomic for t but these 
>>> extra memory barriers definitely help.
>>>
>>> @@ -1143,6 +1164,7 @@ void thread::prepare_wait()
>>>      preempt_disable();
>>>      assert(_detached_state->st.load() == status::running);
>>>      _detached_state->st.store(status::waiting);
>>>
>> +            asm volatile ("dmb ish");
>>>  }
>>>
>>> One thing I missed is that prepare_wait() calls both load() and then 
>> store() on _detached_state->st which is interesting as it means that all 
>> memory operations - both load and store - should be ordered BEFORE and 
>> AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a 
>> general memory barrier instruction that waits for both the store and load 
>> instructions to complete) is *NOT* necessary. But when I remove it, I 
>> get back the same hung failures.
>>
>> Maybe it is because when I use "asm volatile ("dmb ishst");" in 
>> wake_impl() which is NOT specific to _detached_state->st, the rules of 
>> memory barriers do not apply? Meaning the load() has to match store() of 
>> the *SAME* atomic variable, no?
>>
>> For example, this statement from C++ memory ordering documentation seems 
>> to indicate:
>>
>> "memory_order_release - A store operation with this memory order performs 
>> the release operation: no reads or writes in the current thread can be 
>> reordered after this store. All writes in the current thread are visible in 
>> other threads that *acquire the same atomic variable* (see 
>> Release-Acquire ordering below) and writes that carry a dependency into the 
>> atomic variable become visible in other threads* that consume the same 
>> atomic* (see Release-Consume ordering below). "
>>
>
> The more I did into atomics and acquire-release semantics, the more I 
> learn that indeed load()/store() (and their derivatives) need to called 
> against same atomic variable.
>
>>
>> So maybe because we use non-atomic barrier instruction in wake_impl(), we 
>> have to use a similar one in prepare_wait() to make sure that changes to 
>> shared variable *t* are visible between 2 threads.
>>
>> @@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, 
>>> unsigned allowed_initial_states_mask)
>>>      trace_sched_wake(st->t);     while 
>>> (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>>          if (!((1 << unsigned(old_status)) & 
>>> allowed_initial_states_mask)) {
>>> +            asm volatile ("dmb ishst");
>>>              return;
>>>          }
>>>
>> The C++ equivalent to the asm instructions would be this:
>
> @@ -1143,6 +1164,8 @@ void thread::prepare_wait()
>      preempt_disable();
>      assert(_detached_state->st.load() == status::running);
>      _detached_state->st.store(status::waiting);
> +    std::atomic_thread_fence(std::memory_order_acquire);
>  }
>  
>  // This function is responsible for changing a thread's state from
> @@ -1192,6 +1215,15 @@ void thread::wake_impl(detached_state* st, unsigned 
> allowed_initial_states_mask)
>      trace_sched_wake(st->t);
>      while (!st->st.compare_exchange_weak(old_status, status::waking)) {
>          if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) 
> {
> +            std::atomic_thread_fence(std::memory_order_release);
>              return;
>          }
>      }
> @@ -1209,6 +1241,7 @@ void thread::wake_impl(detached_state* st, unsigned 
> allowed_initial_states_mask)
>  
>
>>  
>>>
>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "OSv Development" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/b7dbda8c-a6f6-4fc3-8c76-b94bffb8dba1n%40googlegroups.com.

Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Reply via email to