On Wed, Apr 7, 2021 at 10:56 AM Waldek Kozaczuk <[email protected]> wrote:
> > > On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <[email protected]> wrote: > >> I think you are right about wait_record and memory barriers (unless there >> is still some hole in our thinking :-)). >> >> So let us get back to one of the observations I made and the related >> question I posed here - >> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337 >> . >> >> The thread::wake_impl() method has a deliberate "if" logic to validate >> that the current (old) status of the thread is one of the initial states >> per the specified *allowed_initial_states_mask* argument. If it is NOT, >> the wake_impl() simply returns without actually waking a thread (setting >> need_reschedule to true or calling send_wakeup_ipi()). It is interesting >> that neither wake_impl() nor wake() returns a result of the "waking >> process" - the return type is void. I wonder what the rationale behind it >> was. Maybe in most cases, it does not matter whether we really wake it or >> not. But maybe in some, it does and we should be able to know that thread >> was not really woken. >> >> Normally the thread, the wake_impl() is called on (per st argument) would >> be most likely in the *waiting* state. But clearly, it does not have to >> be the case always - the same thread could have just been woken by another >> thread on another CPU for example. >> >> I have added these debug printouts to see what states thread might be >> when woke_impl() is called: >> >> void thread::wake_impl(detached_state* st, unsigned >> allowed_initial_states_mask) >> { >> status old_status = status::waiting; >> trace_sched_wake(st->t); >> while (!st->st.compare_exchange_weak(old_status, status::waking)) { >> if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) >> { >> * if (allowed_initial_states_mask == ((1 << >> unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {* >> * if (old_status != status::waiting && old_status != >> status::sending_lock) {* >> * debug_early_u64("! wake_impl: ", >> (unsigned)old_status);* >> >> * } * >> * } * >> return; >> } >> } >> >> Please note I am specifically logging only cases when wake_impl() is >> called from thread::wake_with_from_mutex(Action action). With one CPU, I >> occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I >> also occasionally see "waking" and "running", which is interesting. Most of >> the time even if I see these "waking" and "running" printouts, OSv would >> NOT hang. But sometimes it would right after. >> > I think that “queued” is not an issue as this thread is in the runqueue > and will eventually be running. With single CPU the running case is > impossible (one thread can be running) and waking is also ok. > > I wonder if step 2 below is ever possible or should be? > > Or maybe we have some memory ordering issue with waitlist when popping a > waiter is not visible on other cpu and it is still in that list or maybe we > have duplicates of the same thread. > >> >> Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods - >> lock() and unlock() - I think we can imagine following scenario: >> >> 1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up >> creating a wait_record "waiter" on the stack (could NOT acquire the lock), >> adds "waiter" to the M waitqueue, and finally calls waiter.wait() which >> ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out >> and becomes "waiting". >> 2. Later, thread T2 on CPU0 calls thread::wake() or other places in code >> not related to lock-free mutex that calls wake_impl() to wake thread T1 >> (not sure what such scenario might be). >> 3a. Thread T1 becomes running again on CPU0, but it checks the condition >> of the waiter which obviously is NOT true (!t) and it goes back to >> "waiting". >> 3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 on >> CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as part >> of "Responsibility Hand-Off" protocol and calls wake() (assuming it will >> simply wake the thread). But in the end, the wake_impl() "if" determines >> that T1 is running so it actually never wakes it and T1 stays stuck like >> this (no?) unless some other thread later simply wakes it again (t was set >> to null in the waiter). But what if such thread never comes to wake it? >> >> Am I misunderstanding something or describing a scenario that should >> never happen? Or maybe it is a symptom of another problem? >> >> If my thinking is correct, then: >> 1. Shouldn't we check if wake_impl() actually worked all the way in the >> lockfree mutex code and push the waiter back again to its waitqueue if it >> did not? Or is it even more complicated and we should iterate to find a >> waiter it can actually wake? >> 2. Shouldn't the implementation of thread::do_wake_with(Action action, >> unsigned allowed_initial_states_mask) change and call action ONLY if >> wake_impl() actually woke the thread (meaning the while loop suceeded)? We >> would need another wake_impl() that would take action as an argument. >> >> If that is indeed a real problem, why it does not happen on Intel? >> > I have applied the same patch and ran test same tests and I could see the same occasional "waking" and "running" printouts. But on x64 it never hangs. So maybe the scenarios that involve wake_impl() that fails to change state to waking are legitimate cases that are correctly handled but I still do not understand how. So back to the board. So I still think we are missing some memory barrier instructions somewhere. See below. > >> On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote: >> >>> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote: >>>> >>>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <[email protected]> >>>>> wrote: >>>>> >>>>>> As I have been researching a bit the SMP issue described here - >>>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have noticed >>>>>> that the 't' variable in >>>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 >>>>>> might be related to why OSv hangs with two or more CPUs. >>>>>> >>>>>> As I am describing here - >>>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 >>>>>> - writing to the thread pointer variable by thread T1 on cpu0, may not be >>>>>> visible to thread T2 on cpu1 in the weak-memory model scenario. >>>>>> >>>>> >>>>> This is an interesting question. Indeed, the waiter::t is not marked >>>>> atomic, and it's been a very long time since I wrote this code . Note that >>>>> "t" is not public, and the only methods that access t, wake() and wait(). >>>>> >>>>> I *think* (if I remember correctly) my thinking was that each of these >>>>> wake() and wait() operations already has its *own* memory barriers. namely >>>>> wake_impl() and prepare_wait() both access the state of the thread with >>>>> CST >>>>> ordering - >>>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). >>>>> I think (again, I hope I remember correctly), we were hoping that these >>>>> barriers also apply to the t access before of after them. >>>>> >>>>> The thinking was that it is similar to how one uses a shared variable >>>>> protected by the mutex: You don't need the shared variable to be >>>>> std::atomic or care about memory order or barriers - the mutex already >>>>> takes care of that for you. >>>>> >>>> >>>> I think that is true on Intel but not necessarily on ARM. Specifically, >>>> in the SMP case, the changes made to that shared variable ob CPU 1 even >>>> within mutex critical section are not visible to CPU 2 because the changes >>>> made by the former will still be in the L1 cache. Only when we use proper >>>> memory barrier instructions (which I have figured which ones to use and >>>> where exactly) OR std::atomic<> (which on ARM translates to the *LDAR/STLR >>>> instructions *that by nature make the changes to memory atomic and >>>> visible to all CPUs), the changes to _t are visible. >>>> >>> >>> I don't think this is the case. Multi-threaded code which uses >>> traditional Posix mutex does *not* use any std::atomic or barriers. People >>> wrote such code before C++11's std::atomic, and without barriers in >>> assembly language (again, I'm talking about Posix threads code with mutexes >>> - not about lockfree code). >>> The assumption is that the *implementation* of the mutex is using these >>> barriers: "acquire" and "release" barriers so that for example changes that >>> one thread does to a shared variable inside a critical section protected by >>> a mutex will be visible by some other thread on another CPU which later >>> takes the same mutex. >>> >>> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in >>> the mutex implementation ;-)) but they do have barriers as I'll explain >>> below: >>> >>> >>>> Otherwise the wait() method in wait_record.hh will not see _t equal to >>>> null when the thread waiting on CPU2 is woken. So the effect is that the >>>> interrupt arrives, thread is woken but does not see _t value equal to 0. >>>> But I might be wrong in my reasoning. >>>> >>>> >>>>> But you're right that it's hard to catch bugs in this memory ordering >>>>> on x86 and it's easier to do it wrong on ARM. >>>>> >>>>> The question is what exactly is wrong: >>>>> >>>>> 1. Is the ordering in wake_impl() and prepare_wait() (called by >>>>> waiter::wake() and waiter::wait()) not strong enough on ARM to >>>>> serialize >>>>> *everything*, not just a specific variable? >>>>> >>>>> I think this is specific to _t. >>>> >>>>> >>>>> 1. Do we have this serialization in the right place? E.g., maybe >>>>> we have the ordering point before writing t but we need it after >>>>> writing t, >>>>> and so. >>>>> 2. Maybe the problem isn't in waiter::wake() or waiter::wait() but >>>>> in other functions like waiter::woken(), waiter::clear()? >>>>> >>>>> I suggest that you change waiter.hh to use std::atomic for t, and then >>>>> using "relaxed" ordering to read/write t for some of the waiter methods >>>>> and >>>>> "cst" ordering for other methods - and see which method needs the >>>>> non-relaxed ordering. When we know if one specific method is missing the >>>>> ordering, we can try to understand why it is missing. >>>>> >>>> Which specific place in code do you have in mind? Nothing in >>>> wait_record.hh uses memory ordering involved methods. >>>> >>> >>> *wait_record::wait()* calls sched::thread::wait_until(). This calls >>> do_wait_until() which starts creating a "wait_guard". This calls >>> thread::prepare_wait(), which calls >>> * _detached_state->st.store(status::waiting);* >>> because there is no memory-ordering parameter, it uses the default >>> memory_order_seq_cst >>> ordering. >>> The thinking was that that this does all the necessary memory barriers. >>> >>> In the other side, *wait_record::wake() *calls >>> thread::wake_with_from_mutex() which calls do_wake_with(), which does >>> *auto ds = _detached_state.get();* >>> this get() uses memory_order_seq_cst, and so it is the other side of the >>> store barrier we had in wait(). >>> >>> Again, this was the thinking, or at least what I remember from many >>> years ago when I was an expert in this code ;-) >>> Maybe there is something subtly wrong with this thinking on ARM, while >>> it was correct for x86.... >>> >>> So you are right about these memory barriers in prepare_wait() and do_wake_with() but I may have found another very related place where we are missing ones. In thread::do_wait_until() we do use wait_guard that calls prepare_wait(). However, pepare_wait() uses store() which perhaps only issues *store* memory barrier instruction. And wait_until() check the shared variable - t - in this case. For t changed by action in do_wake_with(), are we not missing a load barrier before checking predicate? Relatedly, as you can see above, wake_impl() may fail to change the st variable to waiting for the reasons described above. In that case, perhaps compare_exchange_weak() call that returns false, does NOT issue a store barrier, does it? So maybe in this case before we return in that while loop in wake_impl() we should also issue a memory barrier instruction? I conducted the experiments when I added generic memory barrier instruction - dmb isht - to both prepare_wait() and wake_impl() in the places described and I could see my hello word execute 1-2K times before hanging in some other place. So not as good as using atomic for t but these extra memory barriers definitely help. @@ -1143,6 +1164,7 @@ void thread::prepare_wait() preempt_disable(); assert(_detached_state->st.load() == status::running); _detached_state->st.store(status::waiting); + asm volatile ("dmb ishst"); } @@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask) trace_sched_wake(st->t); while (!st->st.compare_exchange_weak(old_status, status::waking)) { if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) { + asm volatile ("dmb ishst"); return; } > -- >> You received this message because you are subscribed to a topic in the >> Google Groups "OSv Development" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com >> <https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CAL9cFfMPcXuSPOSZ51EeWBmeQtUHbk2PU7MQk%3Dgwmc5i1mzJpA%40mail.gmail.com.
