On Friday, April 9, 2021 at 10:37:55 AM UTC-4 Waldek Kozaczuk wrote:
> On Thu, Apr 8, 2021 at 11:15 AM Waldek Kozaczuk <[email protected]> > wrote: > >> On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]> >> wrote: >> >>> >>> >>> On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]> wrote: >>> >>>> I think you are right about wait_record and memory barriers (unless >>>> there is still some hole in our thinking :-)). >>>> >>>> So let us get back to one of the observations I made and the related >>>> question I posed here - >>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337 >>>> . >>>> >>>> The thread::wake_impl() method has a deliberate "if" logic to validate >>>> that the current (old) status of the thread is one of the initial states >>>> per the specified *allowed_initial_states_mask* argument. If it is >>>> NOT, the wake_impl() simply returns without actually waking a thread >>>> (setting need_reschedule to true or calling send_wakeup_ipi()). It is >>>> interesting that neither wake_impl() nor wake() returns a result of the >>>> "waking process" - the return type is void. I wonder what the rationale >>>> behind it was. Maybe in most cases, it does not matter whether we really >>>> wake it or not. But maybe in some, it does and we should be able to know >>>> that thread was not really woken. >>>> >>>> Normally the thread, the wake_impl() is called on (per st argument) >>>> would be most likely in the *waiting* state. But clearly, it does not >>>> have to be the case always - the same thread could have just been woken by >>>> another thread on another CPU for example. >>>> >>>> I have added these debug printouts to see what states thread might be >>>> when woke_impl() is called: >>>> >>>> void thread::wake_impl(detached_state* st, unsigned >>>> allowed_initial_states_mask) >>>> { >>>> status old_status = status::waiting; >>>> trace_sched_wake(st->t); >>>> while (!st->st.compare_exchange_weak(old_status, status::waking)) { >>>> if (!((1 << unsigned(old_status)) & >>>> allowed_initial_states_mask)) { >>>> * if (allowed_initial_states_mask == ((1 << >>>> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {* >>>> * if (old_status != status::waiting && old_status != >>>> status::sending_lock) {* >>>> * debug_early_u64("! wake_impl: ", >>>> (unsigned)old_status);* >>>> >>>> * } * >>>> * } * >>>> return; >>>> } >>>> } >>>> >>>> Please note I am specifically logging only cases when wake_impl() is >>>> called from thread::wake_with_from_mutex(Action action). With one CPU, I >>>> occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I >>>> also occasionally see "waking" and "running", which is interesting. Most >>>> of >>>> the time even if I see these "waking" and "running" printouts, OSv would >>>> NOT hang. But sometimes it would right after. >>>> >>> I think that “queued” is not an issue as this thread is in the runqueue >>> and will eventually be running. With single CPU the running case is >>> impossible (one thread can be running) and waking is also ok. >>> >>> I wonder if step 2 below is ever possible or should be? >>> >>> Or maybe we have some memory ordering issue with waitlist when popping a >>> waiter is not visible on other cpu and it is still in that list or maybe we >>> have duplicates of the same thread. >>> >>>> >>>> Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods >>>> - lock() and unlock() - I think we can imagine following scenario: >>>> >>>> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up >>>> creating a wait_record "waiter" on the stack (could NOT acquire the lock), >>>> adds "waiter" to the M waitqueue, and finally calls waiter.wait() which >>>> ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out >>>> and becomes "waiting". >>>> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in >>>> code not related to lock-free mutex that calls wake_impl() to wake thread >>>> T1 (not sure what such scenario might be). >>>> 3a. Thread T1 becomes running again on CPU0, but it checks the >>>> condition of the waiter which obviously is NOT true (!t) and it goes back >>>> to "waiting". >>>> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 >>>> on CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as >>>> part of "Responsibility Hand-Off" protocol and calls wake() (assuming it >>>> will simply wake the thread). But in the end, the wake_impl() "if" >>>> determines that T1 is running so it actually never wakes it and T1 stays >>>> stuck like this (no?) unless some other thread later simply wakes it again >>>> (t was set to null in the waiter). But what if such thread never comes to >>>> wake it? >>>> >>>> Am I misunderstanding something or describing a scenario that should >>>> never happen? Or maybe it is a symptom of another problem? >>>> >>>> If my thinking is correct, then: >>>> 1. Shouldn't we check if wake_impl() actually worked all the way in the >>>> lockfree mutex code and push the waiter back again to its waitqueue if it >>>> did not? Or is it even more complicated and we should iterate to find a >>>> waiter it can actually wake? >>>> 2. Shouldn't the implementation of thread::do_wake_with(Action action, >>>> unsigned allowed_initial_states_mask) change and call action ONLY if >>>> wake_impl() actually woke the thread (meaning the while loop suceeded)? We >>>> would need another wake_impl() that would take action as an argument. >>>> >>>> If that is indeed a real problem, why it does not happen on Intel? >>>> >>> >> I have applied the same patch and ran test same tests and I could see the >> same occasional "waking" and "running" printouts. But on x64 it never >> hangs. So maybe the scenarios that involve wake_impl() that fails to change >> state to waking are legitimate cases that are correctly handled but I still >> do not understand how. >> >> So back to the board. So I still think we are missing some memory >> barrier instructions somewhere. See below. >> >> >>> >>>> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote: >>>> >>>>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote: >>>>>> >>>>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> As I have been researching a bit the SMP issue described here - >>>>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have >>>>>>>> noticed that the 't' variable in >>>>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 >>>>>>>> >>>>>>>> might be related to why OSv hangs with two or more CPUs. >>>>>>>> >>>>>>>> As I am describing here - >>>>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 >>>>>>>> >>>>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may not >>>>>>>> be >>>>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario. >>>>>>>> >>>>>>> >>>>>>> This is an interesting question. Indeed, the waiter::t is not marked >>>>>>> atomic, and it's been a very long time since I wrote this code . Note >>>>>>> that >>>>>>> "t" is not public, and the only methods that access t, wake() and >>>>>>> wait(). >>>>>>> >>>>>>> I *think* (if I remember correctly) my thinking was that each of >>>>>>> these wake() and wait() operations already has its *own* memory >>>>>>> barriers. >>>>>>> namely wake_impl() and prepare_wait() both access the state of the >>>>>>> thread >>>>>>> with CST ordering - >>>>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). >>>>>>> >>>>>>> I think (again, I hope I remember correctly), we were hoping that these >>>>>>> barriers also apply to the t access before of after them. >>>>>>> >>>>>>> The thinking was that it is similar to how one uses a shared >>>>>>> variable protected by the mutex: You don't need the shared variable to >>>>>>> be >>>>>>> std::atomic or care about memory order or barriers - the mutex already >>>>>>> takes care of that for you. >>>>>>> >>>>>> >>>>>> I think that is true on Intel but not necessarily on ARM. >>>>>> Specifically, in the SMP case, the changes made to that shared variable >>>>>> ob >>>>>> CPU 1 even within mutex critical section are not visible to CPU 2 >>>>>> because >>>>>> the changes made by the former will still be in the L1 cache. Only when >>>>>> we >>>>>> use proper memory barrier instructions (which I have figured which ones >>>>>> to >>>>>> use and where exactly) OR std::atomic<> (which on ARM translates to the >>>>>> *LDAR/STLR >>>>>> instructions *that by nature make the changes to memory atomic and >>>>>> visible to all CPUs), the changes to _t are visible. >>>>>> >>>>> >>>>> I don't think this is the case. Multi-threaded code which uses >>>>> traditional Posix mutex does *not* use any std::atomic or barriers. >>>>> People >>>>> wrote such code before C++11's std::atomic, and without barriers in >>>>> assembly language (again, I'm talking about Posix threads code with >>>>> mutexes >>>>> - not about lockfree code). >>>>> The assumption is that the *implementation* of the mutex is using >>>>> these barriers: "acquire" and "release" barriers so that for example >>>>> changes that one thread does to a shared variable inside a critical >>>>> section >>>>> protected by a mutex will be visible by some other thread on another CPU >>>>> which later takes the same mutex. >>>>> >>>>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in >>>>> the mutex implementation ;-)) but they do have barriers as I'll explain >>>>> below: >>>>> >>>>> >>>>>> Otherwise the wait() method in wait_record.hh will not see _t equal >>>>>> to null when the thread waiting on CPU2 is woken. So the effect is that >>>>>> the >>>>>> interrupt arrives, thread is woken but does not see _t value equal to 0. >>>>>> But I might be wrong in my reasoning. >>>>>> >>>>>> >>>>>>> But you're right that it's hard to catch bugs in this memory >>>>>>> ordering on x86 and it's easier to do it wrong on ARM. >>>>>>> >>>>>>> The question is what exactly is wrong: >>>>>>> >>>>>>> 1. Is the ordering in wake_impl() and prepare_wait() (called by >>>>>>> waiter::wake() and waiter::wait()) not strong enough on ARM to >>>>>>> serialize >>>>>>> *everything*, not just a specific variable? >>>>>>> >>>>>>> I think this is specific to _t. >>>>>> >>>>>>> >>>>>>> 1. Do we have this serialization in the right place? E.g., maybe >>>>>>> we have the ordering point before writing t but we need it after >>>>>>> writing t, >>>>>>> and so. >>>>>>> 2. Maybe the problem isn't in waiter::wake() or waiter::wait() >>>>>>> but in other functions like waiter::woken(), waiter::clear()? >>>>>>> >>>>>>> I suggest that you change waiter.hh to use std::atomic for t, and >>>>>>> then using "relaxed" ordering to read/write t for some of the waiter >>>>>>> methods and "cst" ordering for other methods - and see which method >>>>>>> needs >>>>>>> the non-relaxed ordering. When we know if one specific method is >>>>>>> missing >>>>>>> the ordering, we can try to understand why it is missing. >>>>>>> >>>>>> Which specific place in code do you have in mind? Nothing in >>>>>> wait_record.hh uses memory ordering involved methods. >>>>>> >>>>> >>>>> *wait_record::wait()* calls sched::thread::wait_until(). This calls >>>>> do_wait_until() which starts creating a "wait_guard". This calls >>>>> thread::prepare_wait(), which calls >>>>> * _detached_state->st.store(status::waiting);* >>>>> because there is no memory-ordering parameter, it uses the default >>>>> memory_order_seq_cst >>>>> ordering. >>>>> The thinking was that that this does all the necessary memory barriers. >>>>> >>>>> In the other side, *wait_record::wake() *calls >>>>> thread::wake_with_from_mutex() which calls do_wake_with(), which does >>>>> *auto ds = _detached_state.get();* >>>>> this get() uses memory_order_seq_cst, and so it is the other side of >>>>> the store barrier we had in wait(). >>>>> >>>> Regarding your explanation about thread::prepare_wait() and do_wake_with(), the more I read about the acquire/release to make changes to some shared variable synced and visible between threads, should not this be reversed? If you look at do_wake_with(), it calls action() after "auto ds = _detached_state.get();" which is acquire operation, right? But normally the order is this: 1) some thread tried to acquire a lock calls *wait_record::wait()* which then effectively loops in thread::do_wait_until() until it see t null, right? 2) another thread calls *wait_record::wake() *which then ultimately set t to null and calls wake_impl() So given that shouldn't we have the acquire operation in the *wait_record::wait()* chain and release operation in the do_wake_with() after we call the action in wake_impl()? In theory, the prepare_wait has this line: assert(_detached_state->st.load() == status::running); - the acquire operation followed by the store which is release(). And the wake_impl() has the loop to call st.compare_exchange_weak() which I think is a release operation If it succeeds. So maybe we are OK except when we break from that loop and return because the thread state was neither waiting nor sending_lock. My atomic_thread_fence() calls try to fill those missing gaps, I think. >>>>> >>>>> Again, this was the thinking, or at least what I remember from many >>>>> years ago when I was an expert in this code ;-) >>>>> Maybe there is something subtly wrong with this thinking on ARM, while >>>>> it was correct for x86.... >>>>> >>>>> So you are right about these memory barriers in prepare_wait() >> and do_wake_with() but I may have found another very related place where we >> are missing ones. >> >> In thread::do_wait_until() we do use wait_guard that calls >> prepare_wait(). However, pepare_wait() uses store() which perhaps only >> issues *store* memory barrier instruction. And wait_until() check the >> shared variable - t - in this case. For t changed by action in >> do_wake_with(), are we not missing a load barrier before checking predicate? >> >> Relatedly, as you can see above, wake_impl() may fail to change the st >> variable to waiting for the reasons described above. In that case, perhaps >> compare_exchange_weak() call that returns false, does NOT issue a store >> barrier, does it? So maybe in this case before we return in that while loop >> in wake_impl() we should also issue a memory barrier instruction? >> >> I conducted the experiments when I added generic memory barrier >> instruction - dmb isht - to both prepare_wait() and wake_impl() in the >> places described and I could see my hello word execute 1-2K times before >> hanging in some other place. So not as good as using atomic for t but these >> extra memory barriers definitely help. >> >> @@ -1143,6 +1164,7 @@ void thread::prepare_wait() >> preempt_disable(); >> assert(_detached_state->st.load() == status::running); >> _detached_state->st.store(status::waiting); >> > + asm volatile ("dmb ish"); >> } >> >> One thing I missed is that prepare_wait() calls both load() and then > store() on _detached_state->st which is interesting as it means that all > memory operations - both load and store - should be ordered BEFORE and > AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a > general memory barrier instruction that waits for both the store and load > instructions to complete) is *NOT* necessary. But when I remove it, I get > back the same hung failures. > > Maybe it is because when I use "asm volatile ("dmb ishst");" in > wake_impl() which is NOT specific to _detached_state->st, the rules of > memory barriers do not apply? Meaning the load() has to match store() of > the *SAME* atomic variable, no? > > For example, this statement from C++ memory ordering documentation seems > to indicate: > > "memory_order_release - A store operation with this memory order performs > the release operation: no reads or writes in the current thread can be > reordered after this store. All writes in the current thread are visible in > other threads that *acquire the same atomic variable* (see > Release-Acquire ordering below) and writes that carry a dependency into the > atomic variable become visible in other threads* that consume the same > atomic* (see Release-Consume ordering below). " > The more I did into atomics and acquire-release semantics, the more I learn that indeed load()/store() (and their derivatives) need to called against same atomic variable. > > So maybe because we use non-atomic barrier instruction in wake_impl(), we > have to use a similar one in prepare_wait() to make sure that changes to > shared variable *t* are visible between 2 threads. > > @@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned >> allowed_initial_states_mask) >> trace_sched_wake(st->t); while >> (!st->st.compare_exchange_weak(old_status, status::waking)) { >> if (!((1 << unsigned(old_status)) & >> allowed_initial_states_mask)) { >> + asm volatile ("dmb ishst"); >> return; >> } >> > The C++ equivalent to the asm instructions would be this: @@ -1143,6 +1164,8 @@ void thread::prepare_wait() preempt_disable(); assert(_detached_state->st.load() == status::running); _detached_state->st.store(status::waiting); + std::atomic_thread_fence(std::memory_order_acquire); } // This function is responsible for changing a thread's state from @@ -1192,6 +1215,15 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask) trace_sched_wake(st->t); while (!st->st.compare_exchange_weak(old_status, status::waking)) { if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) { + std::atomic_thread_fence(std::memory_order_release); return; } } @@ -1209,6 +1241,7 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask) > >> >>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "OSv Development" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/d85532fd-f013-4184-b41e-9ca7917d5e40n%40googlegroups.com.
