The more I read about C++ 11 memory model and atomics, the more I seem to
think that maybe we are missing the correct "synchronize-with"
relationships in the right places needed for changes to _t in wait_record
(and maybe other places) to be visible between threads on different cpus.
More specifically, wake_impl() called from thread::do_wake_with() AFTER
calling action (_t = nullptr), may break from the loop and therefore fail
to store a new value of *_detached_state->st*. In such a scenario, there is
potentially no "release" operation that would tie to a corresponding load
of *_detached_state->st* ("acquire") on other CPU.
I think the only way to provide such a "synchronize-with" relationship for
_t is to turn it into atomic and use release/acquire memory order for the
corresponding load()/store() calls in wait_record.hh.
On Saturday, April 10, 2021 at 11:44:28 PM UTC-4 Waldek Kozaczuk wrote:
> On Friday, April 9, 2021 at 10:37:55 AM UTC-4 Waldek Kozaczuk wrote:
>
>> On Thu, Apr 8, 2021 at 11:15 AM Waldek Kozaczuk <[email protected]>
>> wrote:
>>
>>> On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]>
>>>> wrote:
>>>>
>>>>> I think you are right about wait_record and memory barriers (unless
>>>>> there is still some hole in our thinking :-)).
>>>>>
>>>>> So let us get back to one of the observations I made and the related
>>>>> question I posed here -
>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337
>>>>> .
>>>>>
>>>>> The thread::wake_impl() method has a deliberate "if" logic to validate
>>>>> that the current (old) status of the thread is one of the initial states
>>>>> per the specified *allowed_initial_states_mask* argument. If it is
>>>>> NOT, the wake_impl() simply returns without actually waking a thread
>>>>> (setting need_reschedule to true or calling send_wakeup_ipi()). It is
>>>>> interesting that neither wake_impl() nor wake() returns a result of the
>>>>> "waking process" - the return type is void. I wonder what the rationale
>>>>> behind it was. Maybe in most cases, it does not matter whether we really
>>>>> wake it or not. But maybe in some, it does and we should be able to know
>>>>> that thread was not really woken.
>>>>>
>>>>> Normally the thread, the wake_impl() is called on (per st argument)
>>>>> would be most likely in the *waiting* state. But clearly, it does not
>>>>> have to be the case always - the same thread could have just been woken
>>>>> by
>>>>> another thread on another CPU for example.
>>>>>
>>>>> I have added these debug printouts to see what states thread might be
>>>>> when woke_impl() is called:
>>>>>
>>>>> void thread::wake_impl(detached_state* st, unsigned
>>>>> allowed_initial_states_mask)
>>>>> {
>>>>> status old_status = status::waiting;
>>>>> trace_sched_wake(st->t);
>>>>> while (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>>>> if (!((1 << unsigned(old_status)) &
>>>>> allowed_initial_states_mask)) {
>>>>> * if (allowed_initial_states_mask == ((1 <<
>>>>> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {*
>>>>> * if (old_status != status::waiting && old_status !=
>>>>> status::sending_lock) {*
>>>>> * debug_early_u64("! wake_impl: ",
>>>>> (unsigned)old_status);*
>>>>>
>>>>> * } *
>>>>> * } *
>>>>> return;
>>>>> }
>>>>> }
>>>>>
>>>>> Please note I am specifically logging only cases when wake_impl() is
>>>>> called from thread::wake_with_from_mutex(Action action). With one CPU, I
>>>>> occasionally see that "queued" is logged. With 2 CPUs, besides "queued",
>>>>> I
>>>>> also occasionally see "waking" and "running", which is interesting. Most
>>>>> of
>>>>> the time even if I see these "waking" and "running" printouts, OSv would
>>>>> NOT hang. But sometimes it would right after.
>>>>>
>>>> I think that “queued” is not an issue as this thread is in the runqueue
>>>> and will eventually be running. With single CPU the running case is
>>>> impossible (one thread can be running) and waking is also ok.
>>>>
>>>> I wonder if step 2 below is ever possible or should be?
>>>>
>>>> Or maybe we have some memory ordering issue with waitlist when popping
>>>> a waiter is not visible on other cpu and it is still in that list or maybe
>>>> we have duplicates of the same thread.
>>>>
>>>>>
>>>>> Now if we focus on lockless mutex (core/lfmutex.cc) and two key
>>>>> methods - lock() and unlock() - I think we can imagine following scenario:
>>>>>
>>>>> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends
>>>>> up creating a wait_record "waiter" on the stack (could NOT acquire the
>>>>> lock), adds "waiter" to the M waitqueue, and finally calls waiter.wait()
>>>>> which ends up calling sched::do_wait_until(). Eventually T1 gets
>>>>> scheduled
>>>>> out and becomes "waiting".
>>>>> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in
>>>>> code not related to lock-free mutex that calls wake_impl() to wake thread
>>>>> T1 (not sure what such scenario might be).
>>>>> 3a. Thread T1 becomes running again on CPU0, but it checks the
>>>>> condition of the waiter which obviously is NOT true (!t) and it goes back
>>>>> to "waiting".
>>>>> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3
>>>>> on CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as
>>>>> part of "Responsibility Hand-Off" protocol and calls wake() (assuming it
>>>>> will simply wake the thread). But in the end, the wake_impl() "if"
>>>>> determines that T1 is running so it actually never wakes it and T1 stays
>>>>> stuck like this (no?) unless some other thread later simply wakes it
>>>>> again
>>>>> (t was set to null in the waiter). But what if such thread never comes to
>>>>> wake it?
>>>>>
>>>>> Am I misunderstanding something or describing a scenario that should
>>>>> never happen? Or maybe it is a symptom of another problem?
>>>>>
>>>>> If my thinking is correct, then:
>>>>> 1. Shouldn't we check if wake_impl() actually worked all the way in
>>>>> the lockfree mutex code and push the waiter back again to its waitqueue
>>>>> if
>>>>> it did not? Or is it even more complicated and we should iterate to find
>>>>> a
>>>>> waiter it can actually wake?
>>>>> 2. Shouldn't the implementation of thread::do_wake_with(Action action,
>>>>> unsigned allowed_initial_states_mask) change and call action ONLY if
>>>>> wake_impl() actually woke the thread (meaning the while loop suceeded)?
>>>>> We
>>>>> would need another wake_impl() that would take action as an argument.
>>>>>
>>>>> If that is indeed a real problem, why it does not happen on Intel?
>>>>>
>>>>
>>> I have applied the same patch and ran test same tests and I could see
>>> the same occasional "waking" and "running" printouts. But on x64 it never
>>> hangs. So maybe the scenarios that involve wake_impl() that fails to change
>>> state to waking are legitimate cases that are correctly handled but I still
>>> do not understand how.
>>>
>>> So back to the board. So I still think we are missing some memory
>>> barrier instructions somewhere. See below.
>>>
>>>
>>>>
>>>>> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote:
>>>>>
>>>>>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> As I have been researching a bit the SMP issue described here -
>>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have
>>>>>>>>> noticed that the 't' variable in
>>>>>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
>>>>>>>>>
>>>>>>>>> might be related to why OSv hangs with two or more CPUs.
>>>>>>>>>
>>>>>>>>> As I am describing here -
>>>>>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270
>>>>>>>>>
>>>>>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may
>>>>>>>>> not be
>>>>>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is an interesting question. Indeed, the waiter::t is not
>>>>>>>> marked atomic, and it's been a very long time since I wrote this code
>>>>>>>> .
>>>>>>>> Note that "t" is not public, and the only methods that access t,
>>>>>>>> wake() and
>>>>>>>> wait().
>>>>>>>>
>>>>>>>> I *think* (if I remember correctly) my thinking was that each of
>>>>>>>> these wake() and wait() operations already has its *own* memory
>>>>>>>> barriers.
>>>>>>>> namely wake_impl() and prepare_wait() both access the state of the
>>>>>>>> thread
>>>>>>>> with CST ordering -
>>>>>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
>>>>>>>>
>>>>>>>> I think (again, I hope I remember correctly), we were hoping that
>>>>>>>> these
>>>>>>>> barriers also apply to the t access before of after them.
>>>>>>>>
>>>>>>>> The thinking was that it is similar to how one uses a shared
>>>>>>>> variable protected by the mutex: You don't need the shared variable to
>>>>>>>> be
>>>>>>>> std::atomic or care about memory order or barriers - the mutex already
>>>>>>>> takes care of that for you.
>>>>>>>>
>>>>>>>
>>>>>>> I think that is true on Intel but not necessarily on ARM.
>>>>>>> Specifically, in the SMP case, the changes made to that shared variable
>>>>>>> ob
>>>>>>> CPU 1 even within mutex critical section are not visible to CPU 2
>>>>>>> because
>>>>>>> the changes made by the former will still be in the L1 cache. Only when
>>>>>>> we
>>>>>>> use proper memory barrier instructions (which I have figured which ones
>>>>>>> to
>>>>>>> use and where exactly) OR std::atomic<> (which on ARM translates to the
>>>>>>> *LDAR/STLR
>>>>>>> instructions *that by nature make the changes to memory atomic and
>>>>>>> visible to all CPUs), the changes to _t are visible.
>>>>>>>
>>>>>>
>>>>>> I don't think this is the case. Multi-threaded code which uses
>>>>>> traditional Posix mutex does *not* use any std::atomic or barriers.
>>>>>> People
>>>>>> wrote such code before C++11's std::atomic, and without barriers in
>>>>>> assembly language (again, I'm talking about Posix threads code with
>>>>>> mutexes
>>>>>> - not about lockfree code).
>>>>>> The assumption is that the *implementation* of the mutex is using
>>>>>> these barriers: "acquire" and "release" barriers so that for example
>>>>>> changes that one thread does to a shared variable inside a critical
>>>>>> section
>>>>>> protected by a mutex will be visible by some other thread on another CPU
>>>>>> which later takes the same mutex.
>>>>>>
>>>>>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in
>>>>>> the mutex implementation ;-)) but they do have barriers as I'll explain
>>>>>> below:
>>>>>>
>>>>>>
>>>>>>> Otherwise the wait() method in wait_record.hh will not see _t equal
>>>>>>> to null when the thread waiting on CPU2 is woken. So the effect is that
>>>>>>> the
>>>>>>> interrupt arrives, thread is woken but does not see _t value equal to
>>>>>>> 0.
>>>>>>> But I might be wrong in my reasoning.
>>>>>>>
>>>>>>>
>>>>>>>> But you're right that it's hard to catch bugs in this memory
>>>>>>>> ordering on x86 and it's easier to do it wrong on ARM.
>>>>>>>>
>>>>>>>> The question is what exactly is wrong:
>>>>>>>>
>>>>>>>> 1. Is the ordering in wake_impl() and prepare_wait() (called by
>>>>>>>> waiter::wake() and waiter::wait()) not strong enough on ARM to
>>>>>>>> serialize
>>>>>>>> *everything*, not just a specific variable?
>>>>>>>>
>>>>>>>> I think this is specific to _t.
>>>>>>>
>>>>>>>>
>>>>>>>> 1. Do we have this serialization in the right place? E.g.,
>>>>>>>> maybe we have the ordering point before writing t but we need it
>>>>>>>> after
>>>>>>>> writing t, and so.
>>>>>>>> 2. Maybe the problem isn't in waiter::wake() or waiter::wait()
>>>>>>>> but in other functions like waiter::woken(), waiter::clear()?
>>>>>>>>
>>>>>>>> I suggest that you change waiter.hh to use std::atomic for t, and
>>>>>>>> then using "relaxed" ordering to read/write t for some of the waiter
>>>>>>>> methods and "cst" ordering for other methods - and see which method
>>>>>>>> needs
>>>>>>>> the non-relaxed ordering. When we know if one specific method is
>>>>>>>> missing
>>>>>>>> the ordering, we can try to understand why it is missing.
>>>>>>>>
>>>>>>> Which specific place in code do you have in mind? Nothing in
>>>>>>> wait_record.hh uses memory ordering involved methods.
>>>>>>>
>>>>>>
>>>>>> *wait_record::wait()* calls sched::thread::wait_until(). This calls
>>>>>> do_wait_until() which starts creating a "wait_guard". This calls
>>>>>> thread::prepare_wait(), which calls
>>>>>> * _detached_state->st.store(status::waiting);*
>>>>>> because there is no memory-ordering parameter, it uses the default
>>>>>> memory_order_seq_cst
>>>>>> ordering.
>>>>>> The thinking was that that this does all the necessary memory
>>>>>> barriers.
>>>>>>
>>>>>> In the other side, *wait_record::wake() *calls
>>>>>> thread::wake_with_from_mutex() which calls do_wake_with(), which does
>>>>>> *auto ds = _detached_state.get();*
>>>>>> this get() uses memory_order_seq_cst, and so it is the other side of
>>>>>> the store barrier we had in wait().
>>>>>>
>>>>> Regarding your explanation about thread::prepare_wait() and
> do_wake_with(), the more I read about the acquire/release to make changes
> to some shared variable synced and visible between threads, should not this
> be reversed? If you look at do_wake_with(), it calls action() after "auto
> ds = _detached_state.get();" which is acquire operation, right?
>
The _detached_state is std::unique_ptr<detached_state>, so calling get()
does not provide any load barrier. I think you may have meant another place
where load() is called on the atomic *_detached_state->st.*
> But normally the order is this:
> 1) some thread tried to acquire a lock calls *wait_record::wait()* which
> then effectively loops in thread::do_wait_until() until it see t null,
> right?
> 2) another thread calls *wait_record::wake() *which then ultimately set t
> to null and calls wake_impl()
>
> So given that shouldn't we have the acquire operation in the
> *wait_record::wait()* chain and release operation in the do_wake_with()
> after we call the action in wake_impl()?
>
> In theory, the prepare_wait has this line:
>
> assert(_detached_state->st.load() == status::running); - the acquire
> operation
>
> followed by the store which is release().
>
>
> And the wake_impl() has the loop to call st.compare_exchange_weak() which
> I think is a release operation If it succeeds. So maybe we are OK except
> when we break from that loop and return because the thread state was
> neither waiting nor sending_lock.
>
> My atomic_thread_fence() calls try to fill those missing gaps, I think.
>
>
>>>>>>
>>>>>> Again, this was the thinking, or at least what I remember from many
>>>>>> years ago when I was an expert in this code ;-)
>>>>>> Maybe there is something subtly wrong with this thinking on ARM,
>>>>>> while it was correct for x86....
>>>>>>
>>>>>> So you are right about these memory barriers in prepare_wait()
>>> and do_wake_with() but I may have found another very related place where we
>>> are missing ones.
>>>
>>> In thread::do_wait_until() we do use wait_guard that calls
>>> prepare_wait(). However, pepare_wait() uses store() which perhaps only
>>> issues *store* memory barrier instruction. And wait_until() check the
>>> shared variable - t - in this case. For t changed by action in
>>> do_wake_with(), are we not missing a load barrier before checking predicate?
>>>
>>> Relatedly, as you can see above, wake_impl() may fail to change the st
>>> variable to waiting for the reasons described above. In that case, perhaps
>>> compare_exchange_weak() call that returns false, does NOT issue a store
>>> barrier, does it? So maybe in this case before we return in that while loop
>>> in wake_impl() we should also issue a memory barrier instruction?
>>>
>>> I conducted the experiments when I added generic memory barrier
>>> instruction - dmb isht - to both prepare_wait() and wake_impl() in the
>>> places described and I could see my hello word execute 1-2K times before
>>> hanging in some other place. So not as good as using atomic for t but these
>>> extra memory barriers definitely help.
>>>
>>> @@ -1143,6 +1164,7 @@ void thread::prepare_wait()
>>> preempt_disable();
>>> assert(_detached_state->st.load() == status::running);
>>> _detached_state->st.store(status::waiting);
>>>
>> + asm volatile ("dmb ish");
>>> }
>>>
>>> One thing I missed is that prepare_wait() calls both load() and then
>> store() on _detached_state->st which is interesting as it means that all
>> memory operations - both load and store - should be ordered BEFORE and
>> AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a
>> general memory barrier instruction that waits for both the store and load
>> instructions to complete) is *NOT* necessary. But when I remove it, I
>> get back the same hung failures.
>>
>> Maybe it is because when I use "asm volatile ("dmb ishst");" in
>> wake_impl() which is NOT specific to _detached_state->st, the rules of
>> memory barriers do not apply? Meaning the load() has to match store() of
>> the *SAME* atomic variable, no?
>>
>> For example, this statement from C++ memory ordering documentation seems
>> to indicate:
>>
>> "memory_order_release - A store operation with this memory order performs
>> the release operation: no reads or writes in the current thread can be
>> reordered after this store. All writes in the current thread are visible in
>> other threads that *acquire the same atomic variable* (see
>> Release-Acquire ordering below) and writes that carry a dependency into the
>> atomic variable become visible in other threads* that consume the same
>> atomic* (see Release-Consume ordering below). "
>>
>
> The more I did into atomics and acquire-release semantics, the more I
> learn that indeed load()/store() (and their derivatives) need to called
> against same atomic variable.
>
>>
>> So maybe because we use non-atomic barrier instruction in wake_impl(), we
>> have to use a similar one in prepare_wait() to make sure that changes to
>> shared variable *t* are visible between 2 threads.
>>
>> @@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st,
>>> unsigned allowed_initial_states_mask)
>>> trace_sched_wake(st->t); while
>>> (!st->st.compare_exchange_weak(old_status, status::waking)) {
>>> if (!((1 << unsigned(old_status)) &
>>> allowed_initial_states_mask)) {
>>> + asm volatile ("dmb ishst");
>>> return;
>>> }
>>>
>> The C++ equivalent to the asm instructions would be this:
>
> @@ -1143,6 +1164,8 @@ void thread::prepare_wait()
> preempt_disable();
> assert(_detached_state->st.load() == status::running);
> _detached_state->st.store(status::waiting);
> + std::atomic_thread_fence(std::memory_order_acquire);
> }
>
> // This function is responsible for changing a thread's state from
> @@ -1192,6 +1215,15 @@ void thread::wake_impl(detached_state* st, unsigned
> allowed_initial_states_mask)
> trace_sched_wake(st->t);
> while (!st->st.compare_exchange_weak(old_status, status::waking)) {
> if (!((1 << unsigned(old_status)) & allowed_initial_states_mask))
> {
> + std::atomic_thread_fence(std::memory_order_release);
> return;
> }
> }
> @@ -1209,6 +1241,7 @@ void thread::wake_impl(detached_state* st, unsigned
> allowed_initial_states_mask)
>
>
>>
>>>
>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "OSv Development" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com
>>>>>
>>>>> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/osv-dev/b7dbda8c-a6f6-4fc3-8c76-b94bffb8dba1n%40googlegroups.com.