Re: Mutex oddities with v1.0.0

marko kiiskila Mon, 06 Nov 2017 19:18:45 -0800

> On Nov 6, 2017, at 5:23 PM, Jitesh Shah <[email protected]> wrote:
> 
> Alright,
> I think I have an idea of whats going on here.
> 
> So the structure of my code is like this ->
> os_mutex_pend();
> .. trigger operation here ..
> os_sem_pend();  // Wait for operation to complete. ISR calls
> os_sem_release()
> os_mutex_release();
> 
> First of all following variables from task object are shared between mutex
> and semaphore implementations: t_obj, t_flags, t_lockcnt, t_prio. This
> basically guarantees that one *cannot* nest mutex and semaphore operations.
> Overwriting t_obj on sem_pend() basically guarantees that current task will
> never be off the mutex pending list.


t_obj is only used when waiting for lock, to link the tasks which are waiting
for the lock. Not after lock has been obtained.
Task can only sleep on one lock at a time; it’s not possible to wait for 2 
simultaneously.
So semaphore and mutex using the same t_obj while waiting is ok; preferable 
even.

You can mix using semaphores and mutex.

> 
> Secondly, I have some fundamental doubts about the semaphore implementation:
> 1) os_sem_release() code has a snippet like so:
> 
>>    if (--current->t_lockcnt == 0) {
>>        current->t_flags &= ~OS_TASK_FLAG_LOCK_HELD;
>>    }
> 
> For semaphores, (as opposed to mutexes) the task which is pending on
> semaphore might NOT be the task that releases the semaphore. Thus, reducing
> the "lockcnt" or adjusting the flags field of the "current" task inside
> os_sem_release() doesn't make sense. What are your thoughts on this?

You’re right; the bookkeeping for the flag OS_TASK_FLAG_LOCK_HELD is incorrect.
However, this is only checked when os_task_remove() is called. It was added to
safeguard user from deleting a task which is still holding locks. Given the 
nature of
semaphores, we probably cannot do this for that style lock. I think we should 
remove that
from os_sem_XXX() calls.

> 
> Jitesh
> 
> On Mon, Nov 6, 2017 at 4:18 PM, marko kiiskila <[email protected]> wrote:
> 
>> I’d start by looking for memory corruption. You could try adding
>> guard variables around your send_mutex(), and see if anything
>> stomps on them. Another option could be to change mutex_release() to
>> write something other than 0 at mu_owner, and then add a conditional
>> hardware watchpoint which looks if anything tries to zero out mu_owner
>> of your send_mutex.
>> 
>> Good luck with the hunt.
>> 
>>> On Nov 6, 2017, at 3:39 PM, will sanfilippo <[email protected]> wrote:
>>> 
>>> Yeah, Chris is right here. I did not read the email thoroughly enough
>> and if what I described happened, the owner would not be NULL. Sorry about
>> that.
>>> 
>>> So while it would explain lockcnt and level, it would not explain why
>> the owner is NULL, as failing to release the mutex would have the owner set
>> to something.
>>> 
>>> 
>>> 
>>>> On Nov 6, 2017, at 3:33 PM, Christopher Collins <[email protected]>
>> wrote:
>>>> 
>>>> I agree that a mutex should never have a null owner and a nonzero level.
>>>> 
>>>> Unfortunately, my first guess is some form of memory corruption:
>>>> it seems like a null value accidentally got written to `mu_owner`.  I
>>>> could be missing it, but I don't see any logic error in the mutex code
>>>> which could cause this.
>>>> 
>>>> Getting to the bottom of this is probably going to be difficult,
>>>> especially if it is not easy to reproduce.  I don't know how valuable
>>>> they are, but my two suggestions are:
>>>> 
>>>> 1. Look at the `.lst` file that newt generates during a build to
>>>> determine what object immediately follows the mutex in RAM.  Maybe an
>>>> errant write intended for this object is clearing the owner field.
>>>> 
>>>> 2. Instrument the code with a bunch of asserts and logs.  Maybe you can
>>>> catch the problem shortly after it happens.
>>>> 
>>>> Like I said, probably not the most helpful advice, but I don't think
>>>> this is going to be an easy one to solve!
>>>> 
>>>> Chris
>>>> 
>>>> On Mon, Nov 06, 2017 at 03:16:06PM -0800, Jitesh Shah wrote:
>>>>> Hey wil,
>>>>> Are you saying that because "mu_level" is set to 1?
>>>>> 
>>>>> It is set to 1 because the last call to os_mutex_release() failed on
>>>>> account of "mu_owner" not matching. Thus, the task that got the mutex
>>>>> failed to release it. That explains t_lockcnt and mu_level, right?
>>>>> 
>>>>> Jitesh
>>>>> 
>>>>> On Mon, Nov 6, 2017 at 7:56 AM, will sanfilippo <[email protected]>
>> wrote:
>>>>> 
>>>>>> What this looks like to me is that there was a nested pend without the
>>>>>> same number of releases. Maybe some path out of some code that is
>> rarely
>>>>>> hit where a mutex is granted but not released?
>>>>>> 
>>>>>> Just a guess...
>>>>>> 
>>>>>>> On Nov 5, 2017, at 8:26 PM, Jitesh Shah <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>> Hey Guys,
>>>>>>> I am running v1.0.0 branch (0db6321a75deda126943aa187842da
>> 6b977cd1c1).
>>>>>>> Seeing some strange mutex behaviour.
>>>>>>> 
>>>>>>> So once in a bazillion times, a mutex fails to release. Here is how
>> the
>>>>>>> structure looks like when it fails:
>>>>>>> 
>>>>>>>> (gdb) p/x send_mutex
>>>>>>>> $1 = {mu_head = {slh_first = 0x0}, _pad = 0x0, mu_prio = 0x1,
>> mu_level =
>>>>>>>> 0x1, mu_owner = 0x0}
>>>>>>> 
>>>>>>> 
>>>>>>> Why is mu_owner set to 0? That causes the os_mutex_release call to
>> fail
>>>>>>> since the current task doesn't match the owner task anymore.
>>>>>>> 
>>>>>>> The task which holds the mutex looks like this:
>>>>>>> 
>>>>>>>> (gdb) p/x cent_task
>>>>>>>> $3 = {t_stackptr = 0x20008a28, t_stacktop = 0x20008ac8, t_stacksize
>> =
>>>>>>>> 0x80, t_taskid = 0x6, t_prio = 0x1, t_state = 0x1, t_flags = 0x10,
>>>>>>>> t_lockcnt = 0x1, t_pad = 0x0,
>>>>>>>> t_name = 0x22378, t_func = 0x90ad, t_arg = 0x0, t_obj = 0x0,
>>>>>>>> t_sanity_check = {sc_checkin_last = 0x0, sc_checkin_itvl = 0x0,
>> sc_func
>>>>>> =
>>>>>>>> 0x0, sc_arg = 0x0, sc_next = {
>>>>>>>>   sle_next = 0x0}}, t_next_wakeup = 0x0, t_run_time = 0x0,
>>>>>>>> t_ctx_sw_cnt = 0x213d, t_os_task_list = {stqe_next = 0x0},
>> t_os_list =
>>>>>>>> {tqe_next = 0x20001338,
>>>>>>>> tqe_prev = 0x200001a8}, t_obj_list = {sle_next = 0x0}}
>>>>>>> 
>>>>>>> 
>>>>>>> Comparing t_prio and mu_prio, this confirms that this task is indeed
>>>>>>> holding the mutex (no other task is waiting on the mutex).
>>>>>>> 
>>>>>>> What can happen that set mu_owner to 0? My original theory was that
>> if a
>>>>>>> mutex_pend was called from an interrupt context, mu_owner would be
>> 0. But
>>>>>>> in this case, the only task that is calling mutex is running an
>> eventq,
>>>>>> so
>>>>>>> that is unlikely.
>>>>>>> 
>>>>>>> Any ideas?
>>>>>>> 
>>>>>>> Jitesh
>>>>>>> 
>>>>>>> --
>>>>>>> This email including attachments contains Mad Apparel, Inc. DBA Athos
>>>>>>> privileged, confidential, and proprietary information solely for the
>> use
>>>>>>> for the addressed recipients. If you are not the intended recipient,
>>>>>> please
>>>>>>> be aware that any review, disclosure, copying, distribution, or use
>> of
>>>>>> the
>>>>>>> contents of this message is strictly prohibited. If you have received
>>>>>> this
>>>>>>> in error, please delete it immediately and notify the sender. All
>> rights
>>>>>>> reserved by Mad Apparel, Inc. 2012. The information contained herein
>> is
>>>>>> the
>>>>>>> exclusive property of Mad Apparel, Inc. and should not be used,
>>>>>>> distributed, reproduced, or disclosed in whole or in part without
>> prior
>>>>>>> written permission of Mad Apparel, Inc.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> This email including attachments contains Mad Apparel, Inc. DBA Athos
>>>>> privileged, confidential, and proprietary information solely for the
>> use
>>>>> for the addressed recipients. If you are not the intended recipient,
>> please
>>>>> be aware that any review, disclosure, copying, distribution, or use of
>> the
>>>>> contents of this message is strictly prohibited. If you have received
>> this
>>>>> in error, please delete it immediately and notify the sender. All
>> rights
>>>>> reserved by Mad Apparel, Inc. 2012. The information contained herein
>> is the
>>>>> exclusive property of Mad Apparel, Inc. and should not be used,
>>>>> distributed, reproduced, or disclosed in whole or in part without prior
>>>>> written permission of Mad Apparel, Inc.
>>> 
>> 
>> 
> 
> -- 
> This email including attachments contains Mad Apparel, Inc. DBA Athos 
> privileged, confidential, and proprietary information solely for the use 
> for the addressed recipients. If you are not the intended recipient, please 
> be aware that any review, disclosure, copying, distribution, or use of the 
> contents of this message is strictly prohibited. If you have received this 
> in error, please delete it immediately and notify the sender. All rights 
> reserved by Mad Apparel, Inc. 2012. The information contained herein is the 
> exclusive property of Mad Apparel, Inc. and should not be used, 
> distributed, reproduced, or disclosed in whole or in part without prior 
> written permission of Mad Apparel, Inc.

Re: Mutex oddities with v1.0.0

Reply via email to