I think you are right about wait_record and memory barriers (unless there 
is still some hole in our thinking :-)).

So let us get back to one of the observations I made and the related 
question I posed here 
- https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337. 

The thread::wake_impl() method has a deliberate "if" logic to validate that 
the current (old) status of the thread is one of the initial states per the 
specified *allowed_initial_states_mask* argument. If it is NOT, the 
wake_impl() simply returns without actually waking a thread (setting 
need_reschedule to true or calling send_wakeup_ipi()). It is interesting 
that neither wake_impl() nor wake() returns a result of the "waking 
process" - the return type is void. I wonder what the rationale behind it 
was. Maybe in most cases, it does not matter whether we really wake it or 
not. But maybe in some, it does and we should be able to know that thread 
was not really woken.

Normally the thread, the wake_impl() is called on (per st argument) would 
be most likely in the *waiting* state. But clearly, it does not have to be 
the case always - the same thread could have just been woken by another 
thread on another CPU for example. 

I have added these debug printouts to see what states thread might be when 
woke_impl() is called:

void thread::wake_impl(detached_state* st, unsigned 
allowed_initial_states_mask)
{
    status old_status = status::waiting;
    trace_sched_wake(st->t);
    while (!st->st.compare_exchange_weak(old_status, status::waking)) {
        if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
*            if (allowed_initial_states_mask == ((1 << 
unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {*
*                if (old_status != status::waiting && old_status != 
status::sending_lock) {*
*                   debug_early_u64("! wake_impl: ", (unsigned)old_status);*

*                }    *
*            }    *
            return;
        }    
    }    

Please note I am specifically logging only cases when wake_impl() is called 
from thread::wake_with_from_mutex(Action action). With one CPU, I 
occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I 
also occasionally see "waking" and "running", which is interesting. Most of 
the time even if I see these "waking" and "running" printouts, OSv would 
NOT hang. But sometimes it would right after.

Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods - 
lock() and unlock() - I think we can imagine following scenario:

1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up 
creating a wait_record "waiter" on the stack (could NOT acquire the lock), 
adds "waiter" to the M waitqueue, and finally calls waiter.wait() which 
ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out 
and becomes "waiting".
2. Later, thread T2 on CPU0 calls thread::wake() or other places in code 
not related to lock-free mutex that calls wake_impl() to wake thread T1 
(not sure what such scenario might be).
3a. Thread T1 becomes running again on CPU0, but it checks the condition of 
the waiter which obviously is NOT true (!t) and it goes back to "waiting".
3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 on 
CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as part 
of "Responsibility Hand-Off" protocol and calls wake() (assuming it will 
simply wake the thread). But in the end, the wake_impl() "if" determines 
that T1 is running so it actually never wakes it and T1 stays stuck like 
this (no?) unless some other thread later simply wakes it again (t was set 
to null in the waiter). But what if such thread never comes to wake it?

Am I misunderstanding something or describing a scenario that should never 
happen? Or maybe it is a symptom of another problem?

If my thinking is correct, then:
1. Shouldn't we check if wake_impl() actually worked all the way in the 
lockfree mutex code and push the waiter back again to its waitqueue if it 
did not? Or is it even more complicated and we should iterate to find a 
waiter it can actually wake?
2. Shouldn't the implementation of thread::do_wake_with(Action action, 
unsigned allowed_initial_states_mask) change and call action ONLY if 
wake_impl() actually woke the thread (meaning the while loop suceeded)? We 
would need another wake_impl() that would take action as an argument.
 
If that is indeed a real problem, why it does not happen on Intel?

On Sunday, April 4, 2021 at 12:32:04 PM UTC-4 Nadav Har'El wrote:

> On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
>
>>
>>
>> On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
>>
>>> On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> 
>>> wrote:
>>>
>>>> As I have been researching a bit the SMP issue described here - 
>>>> https://github.com/cloudius-systems/osv/issues/1123 - I have noticed 
>>>> that the 't' variable in 
>>>> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
>>>>  
>>>> might be related to why OSv hangs with two or more CPUs.
>>>>
>>>> As I am describing here - 
>>>> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 
>>>> - writing to the thread pointer variable by thread T1 on cpu0, may not be 
>>>> visible to thread T2 on cpu1 in the weak-memory model scenario. 
>>>>
>>>
>>> This is an interesting question. Indeed, the waiter::t is not marked 
>>> atomic, and it's been a very long time since I wrote this code . Note that 
>>> "t" is not public, and the only methods that access t, wake() and wait().
>>>
>>> I *think* (if I remember correctly) my thinking was that each of these 
>>> wake() and wait() operations already has its *own* memory barriers. namely 
>>> wake_impl() and prepare_wait() both access the state of the thread with CST 
>>> ordering - 
>>> https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
>>>  
>>> I think (again, I hope I remember correctly), we were hoping that these 
>>> barriers also apply to the t access before of after them. 
>>>
>>> The thinking was that it is similar to how one uses a shared variable 
>>> protected by the mutex: You don't need the shared variable to be 
>>> std::atomic or care about memory order or barriers - the mutex already 
>>> takes care of that for you.
>>>
>>
>> I think that is true on Intel but not necessarily on ARM. Specifically, 
>> in the SMP case, the changes made to that shared variable ob CPU 1 even 
>> within mutex critical section are not visible to CPU 2 because the changes 
>> made by the former will still be in the L1 cache. Only when we use proper 
>> memory barrier instructions (which I have figured which ones to use and 
>> where exactly) OR std::atomic<> (which on ARM translates to the *LDAR/STLR 
>> instructions *that by nature make the changes to memory atomic and 
>> visible to all CPUs), the changes to _t are visible.
>>
>
> I don't think this is the case. Multi-threaded code which uses traditional 
> Posix mutex does *not* use any std::atomic or barriers. People wrote such 
> code before C++11's std::atomic, and without barriers in assembly language 
> (again, I'm talking about Posix threads code with mutexes - not about 
> lockfree code).
> The assumption is that the *implementation* of the mutex is using these 
> barriers: "acquire" and "release" barriers so that for example changes that 
> one thread does to a shared variable inside a critical section protected by 
> a mutex will be visible by some other thread on another CPU which later 
> takes the same mutex.
>  
> The wait_record.hh wait()/wake() do NOT use a mutex (they are used in the 
> mutex implementation ;-)) but they do have barriers as I'll explain below:
>
>
>> Otherwise the wait() method in wait_record.hh will not see _t equal to 
>> null when the thread waiting on CPU2 is woken. So the effect is that the 
>> interrupt arrives, thread is woken but does not see _t value equal to 0. 
>> But I might be wrong in my reasoning.
>>
>>
>>> But you're right that it's hard to catch bugs in this memory ordering on 
>>> x86 and it's easier to do it wrong on ARM.
>>>
>>> The question is what exactly is wrong:
>>>
>>>    1. Is the ordering in wake_impl() and prepare_wait() (called by 
>>>    waiter::wake() and waiter::wait()) not strong enough on ARM to serialize 
>>>    *everything*, not just a specific variable?
>>>
>>> I think this is specific to _t. 
>>
>>>
>>>    1. Do we have this serialization in the right place? E.g., maybe we 
>>>    have the ordering point before writing t but we need it after writing t, 
>>>    and so.
>>>    2. Maybe the problem isn't in waiter::wake() or waiter::wait() but 
>>>    in other functions like waiter::woken(), waiter::clear()?
>>>
>>> I suggest that you change waiter.hh to use std::atomic for t, and then 
>>> using "relaxed" ordering to read/write t for some of the waiter methods and 
>>> "cst" ordering for other methods - and see which method needs the 
>>> non-relaxed ordering. When we know if one specific method is missing the 
>>> ordering, we can try to understand why it is missing.
>>>
>> Which specific place in code do you have in mind? Nothing in 
>> wait_record.hh uses memory ordering involved methods.
>>
>
> *wait_record::wait()* calls sched::thread::wait_until(). This calls 
> do_wait_until() which starts creating a "wait_guard".  This calls 
> thread::prepare_wait(), which calls
>    * _detached_state->st.store(status::waiting);*
> because there is no memory-ordering parameter, it uses the default 
> memory_order_seq_cst 
> ordering.
> The thinking was that that this does all the necessary memory barriers.
>
> In the other side, *wait_record::wake() *calls 
> thread::wake_with_from_mutex() which calls do_wake_with(), which does
>         *auto ds = _detached_state.get();*
> this get() uses memory_order_seq_cst, and so it is the other side of the 
> store barrier we had in wait(). 
>
> Again, this was the thinking, or at least what I remember from many years 
> ago when I was an expert in this code ;-)
> Maybe there is something subtly wrong with this thinking on ARM, while it 
> was correct for x86....
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com.

Reply via email to