Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Waldek Kozaczuk Thu, 08 Apr 2021 08:15:47 -0700

On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]>
wrote:


>
>
> On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]> wrote:
>
>> I think you are right about wait_record and memory barriers (unless there
>> is still some hole in our thinking :-)).
>>
>> So let us get back to one of the observations I made and the related
>> question I posed here -
>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337
>> .
>>
>> The thread::wake_impl() method has a deliberate "if" logic to validate
>> that the current (old) status of the thread is one of the initial states
>> per the specified *allowed_initial_states_mask* argument. If it is NOT,
>> the wake_impl() simply returns without actually waking a thread (setting
>> need_reschedule to true or calling send_wakeup_ipi()). It is interesting
>> that neither wake_impl() nor wake() returns a result of the "waking
>> process" - the return type is void. I wonder what the rationale behind it
>> was. Maybe in most cases, it does not matter whether we really wake it or
>> not. But maybe in some, it does and we should be able to know that thread
>> was not really woken.
>>
>> Normally the thread, the wake_impl() is called on (per st argument) would
>> be most likely in the *waiting* state. But clearly, it does not have to
>> be the case always - the same thread could have just been woken by another
>> thread on another CPU for example.
>>
>> I have added these debug printouts to see what states thread might be
>> when woke_impl() is called:
>>
>> void thread::wake_impl(detached_state* st, unsigned
>> allowed_initial_states_mask)
>> {
>>     status old_status = status::waiting;
>>     trace_sched_wake(st->t);
>>     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask))
>> {
>> *            if (allowed_initial_states_mask == ((1 <<
>> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {*
>> *                if (old_status != status::waiting && old_status !=
>> status::sending_lock) {*
>> *                   debug_early_u64("! wake_impl: ",
>> (unsigned)old_status);*
>>
>> *                }    *
>> *            }    *
>>             return;
>>         }
>>     }
>>
>> Please note I am specifically logging only cases when wake_impl() is
>> called from thread::wake_with_from_mutex(Action action). With one CPU, I
>> occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I
>> also occasionally see "waking" and "running", which is interesting. Most of
>> the time even if I see these "waking" and "running" printouts, OSv would
>> NOT hang. But sometimes it would right after.
>>
> I think that “queued” is not an issue as this thread is in the runqueue
> and will eventually be running. With single CPU the running case is
> impossible (one thread can be running) and waking is also ok.
>
> I wonder if step 2 below is ever possible or should be?
>
> Or maybe we have some memory ordering issue with waitlist when popping a
> waiter is not visible on other cpu and it is still in that list or maybe we
> have duplicates of the same thread.
>
>>
>> Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods -
>> lock() and unlock() - I think we can imagine following scenario:
>>
>> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up
>> creating a wait_record "waiter" on the stack (could NOT acquire the lock),
>> adds "waiter" to the M waitqueue, and finally calls waiter.wait() which
>> ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out
>> and becomes "waiting".
>> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in code
>> not related to lock-free mutex that calls wake_impl() to wake thread T1
>> (not sure what such scenario might be).
>> 3a. Thread T1 becomes running again on CPU0, but it checks the condition
>> of the waiter which obviously is NOT true (!t) and it goes back to
>> "waiting".
>> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 on
>> CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as part
>> of "Responsibility Hand-Off" protocol and calls wake() (assuming it will
>> simply wake the thread). But in the end, the wake_impl() "if" determines
>> that T1 is running so it actually never wakes it and T1 stays stuck like
>> this (no?) unless some other thread later simply wakes it again (t was set
>> to null in the waiter). But what if such thread never comes to wake it?
>>
>> Am I misunderstanding something or describing a scenario that should
>> never happen? Or maybe it is a symptom of another problem?
>>
>> If my thinking is correct, then:
>> 1. Shouldn't we check if wake_impl() actually worked all the way in the
>> lockfree mutex code and push the waiter back again to its waitqueue if it
>> did not? Or is it even more complicated and we should iterate to find a
>> waiter it can actually wake?
>> 2. Shouldn't the implementation of thread::do_wake_with(Action action,
>> unsigned allowed_initial_states_mask) change and call action ONLY if
>> wake_impl() actually woke the thread (meaning the while loop suceeded)? We
>> would need another wake_impl() that would take action as an argument.
>>
>> If that is indeed a real problem, why it does not happen on Intel?
>>
>
I have applied the same patch and ran test same tests and I could see the
same occasional "waking" and "running" printouts. But on x64 it never
hangs. So maybe the scenarios that involve wake_impl() that fails to change
state to waking are legitimate cases that are correctly handled but I still
do not understand how.

So back to the board. So I still think we are missing some memory
barrier instructions somewhere. See below.


>
>> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote:
>>
>>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
>>>>
>>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> As I have been researching a bit the SMP issue described here -
>>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have noticed
>>>>>> that the 't' variable in
>>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
>>>>>> might be related to why OSv hangs with two or more CPUs.
>>>>>>
>>>>>> As I am describing here -
>>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270
>>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may not be
>>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario.
>>>>>>
>>>>>
>>>>> This is an interesting question. Indeed, the waiter::t is not marked
>>>>> atomic, and it's been a very long time since I wrote this code . Note that
>>>>> "t" is not public, and the only methods that access t, wake() and wait().
>>>>>
>>>>> I *think* (if I remember correctly) my thinking was that each of these
>>>>> wake() and wait() operations already has its *own* memory barriers. namely
>>>>> wake_impl() and prepare_wait() both access the state of the thread with 
>>>>> CST
>>>>> ordering -
>>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
>>>>> I think (again, I hope I remember correctly), we were hoping that these
>>>>> barriers also apply to the t access before of after them.
>>>>>
>>>>> The thinking was that it is similar to how one uses a shared variable
>>>>> protected by the mutex: You don't need the shared variable to be
>>>>> std::atomic or care about memory order or barriers - the mutex already
>>>>> takes care of that for you.
>>>>>
>>>>
>>>> I think that is true on Intel but not necessarily on ARM. Specifically,
>>>> in the SMP case, the changes made to that shared variable ob CPU 1 even
>>>> within mutex critical section are not visible to CPU 2 because the changes
>>>> made by the former will still be in the L1 cache. Only when we use proper
>>>> memory barrier instructions (which I have figured which ones to use and
>>>> where exactly) OR std::atomic<> (which on ARM translates to the *LDAR/STLR
>>>> instructions *that by nature make the changes to memory atomic and
>>>> visible to all CPUs), the changes to _t are visible.
>>>>
>>>
>>> I don't think this is the case. Multi-threaded code which uses
>>> traditional Posix mutex does *not* use any std::atomic or barriers. People
>>> wrote such code before C++11's std::atomic, and without barriers in
>>> assembly language (again, I'm talking about Posix threads code with mutexes
>>> - not about lockfree code).
>>> The assumption is that the *implementation* of the mutex is using these
>>> barriers: "acquire" and "release" barriers so that for example changes that
>>> one thread does to a shared variable inside a critical section protected by
>>> a mutex will be visible by some other thread on another CPU which later
>>> takes the same mutex.
>>>
>>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in
>>> the mutex implementation ;-)) but they do have barriers as I'll explain
>>> below:
>>>
>>>
>>>> Otherwise the wait() method in wait_record.hh will not see _t equal to
>>>> null when the thread waiting on CPU2 is woken. So the effect is that the
>>>> interrupt arrives, thread is woken but does not see _t value equal to 0.
>>>> But I might be wrong in my reasoning.
>>>>
>>>>
>>>>> But you're right that it's hard to catch bugs in this memory ordering
>>>>> on x86 and it's easier to do it wrong on ARM.
>>>>>
>>>>> The question is what exactly is wrong:
>>>>>
>>>>>    1. Is the ordering in wake_impl() and prepare_wait() (called by
>>>>>    waiter::wake() and waiter::wait()) not strong enough on ARM to 
>>>>> serialize
>>>>>    *everything*, not just a specific variable?
>>>>>
>>>>> I think this is specific to _t.
>>>>
>>>>>
>>>>>    1. Do we have this serialization in the right place? E.g., maybe
>>>>>    we have the ordering point before writing t but we need it after 
>>>>> writing t,
>>>>>    and so.
>>>>>    2. Maybe the problem isn't in waiter::wake() or waiter::wait() but
>>>>>    in other functions like waiter::woken(), waiter::clear()?
>>>>>
>>>>> I suggest that you change waiter.hh to use std::atomic for t, and then
>>>>> using "relaxed" ordering to read/write t for some of the waiter methods 
>>>>> and
>>>>> "cst" ordering for other methods - and see which method needs the
>>>>> non-relaxed ordering. When we know if one specific method is missing the
>>>>> ordering, we can try to understand why it is missing.
>>>>>
>>>> Which specific place in code do you have in mind? Nothing in
>>>> wait_record.hh uses memory ordering involved methods.
>>>>
>>>
>>> *wait_record::wait()* calls sched::thread::wait_until(). This calls
>>> do_wait_until() which starts creating a "wait_guard".  This calls
>>> thread::prepare_wait(), which calls
>>>    * _detached_state->st.store(status::waiting);*
>>> because there is no memory-ordering parameter, it uses the default 
>>> memory_order_seq_cst
>>> ordering.
>>> The thinking was that that this does all the necessary memory barriers.
>>>
>>> In the other side, *wait_record::wake() *calls
>>> thread::wake_with_from_mutex() which calls do_wake_with(), which does
>>>         *auto ds = _detached_state.get();*
>>> this get() uses memory_order_seq_cst, and so it is the other side of the
>>> store barrier we had in wait().
>>>
>>> Again, this was the thinking, or at least what I remember from many
>>> years ago when I was an expert in this code ;-)
>>> Maybe there is something subtly wrong with this thinking on ARM, while
>>> it was correct for x86....
>>>
>>> So you are right about these memory barriers in prepare_wait()
and do_wake_with() but I may have found another very related place where we
are missing ones.

In thread::do_wait_until() we do use wait_guard that calls prepare_wait().
However, pepare_wait() uses store() which perhaps only issues *store*
memory barrier instruction. And wait_until() check the shared variable - t
- in this case. For t changed by action in do_wake_with(), are we not
missing a load barrier before checking predicate?

Relatedly, as you can see above, wake_impl() may fail to change the st
variable to waiting for the reasons described above. In that case, perhaps
compare_exchange_weak() call that returns false, does NOT issue a store
barrier, does it? So maybe in this case before we return in that while loop
in wake_impl() we should also issue a memory barrier instruction?

I conducted the experiments when I added generic memory barrier instruction
- dmb isht - to both prepare_wait() and wake_impl() in the places described
and I could see my hello word execute 1-2K times before hanging in some
other place. So not as good as using atomic for t but these extra memory
barriers definitely help.

@@ -1143,6 +1164,7 @@ void thread::prepare_wait()
     preempt_disable();
     assert(_detached_state->st.load() == status::running);
     _detached_state->st.store(status::waiting);
+            asm volatile ("dmb ishst");
 }

@@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned
allowed_initial_states_mask)
     trace_sched_wake(st->t);     while
(!st->st.compare_exchange_weak(old_status, status::waking)) {
         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+            asm volatile ("dmb ishst");
             return;
         }


> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "OSv Development" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com
>> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfMPcXuSPOSZ51EeWBmeQtUHbk2PU7MQk%3Dgwmc5i1mzJpA%40mail.gmail.com.

Re: [osv-dev] aarch64: waiter.hh - should t variable be an atomic?

Reply via email to