I’d start by looking for memory corruption. You could try adding
guard variables around your send_mutex(), and see if anything
stomps on them. Another option could be to change mutex_release() to
write something other than 0 at mu_owner, and then add a conditional
hardware watchpoint which looks if anything tries to zero out mu_owner
of your send_mutex.

Good luck with the hunt.

> On Nov 6, 2017, at 3:39 PM, will sanfilippo <[email protected]> wrote:
> 
> Yeah, Chris is right here. I did not read the email thoroughly enough and if 
> what I described happened, the owner would not be NULL. Sorry about that.
> 
> So while it would explain lockcnt and level, it would not explain why the 
> owner is NULL, as failing to release the mutex would have the owner set to 
> something.
> 
> 
> 
>> On Nov 6, 2017, at 3:33 PM, Christopher Collins <[email protected]> wrote:
>> 
>> I agree that a mutex should never have a null owner and a nonzero level.
>> 
>> Unfortunately, my first guess is some form of memory corruption:
>> it seems like a null value accidentally got written to `mu_owner`.  I
>> could be missing it, but I don't see any logic error in the mutex code
>> which could cause this.
>> 
>> Getting to the bottom of this is probably going to be difficult,
>> especially if it is not easy to reproduce.  I don't know how valuable
>> they are, but my two suggestions are:
>> 
>> 1. Look at the `.lst` file that newt generates during a build to
>> determine what object immediately follows the mutex in RAM.  Maybe an
>> errant write intended for this object is clearing the owner field.
>> 
>> 2. Instrument the code with a bunch of asserts and logs.  Maybe you can
>> catch the problem shortly after it happens.
>> 
>> Like I said, probably not the most helpful advice, but I don't think
>> this is going to be an easy one to solve!
>> 
>> Chris
>> 
>> On Mon, Nov 06, 2017 at 03:16:06PM -0800, Jitesh Shah wrote:
>>> Hey wil,
>>> Are you saying that because "mu_level" is set to 1?
>>> 
>>> It is set to 1 because the last call to os_mutex_release() failed on
>>> account of "mu_owner" not matching. Thus, the task that got the mutex
>>> failed to release it. That explains t_lockcnt and mu_level, right?
>>> 
>>> Jitesh
>>> 
>>> On Mon, Nov 6, 2017 at 7:56 AM, will sanfilippo <[email protected]> wrote:
>>> 
>>>> What this looks like to me is that there was a nested pend without the
>>>> same number of releases. Maybe some path out of some code that is rarely
>>>> hit where a mutex is granted but not released?
>>>> 
>>>> Just a guess...
>>>> 
>>>>> On Nov 5, 2017, at 8:26 PM, Jitesh Shah <[email protected]> wrote:
>>>>> 
>>>>> Hey Guys,
>>>>> I am running v1.0.0 branch (0db6321a75deda126943aa187842da6b977cd1c1).
>>>>> Seeing some strange mutex behaviour.
>>>>> 
>>>>> So once in a bazillion times, a mutex fails to release. Here is how the
>>>>> structure looks like when it fails:
>>>>> 
>>>>>> (gdb) p/x send_mutex
>>>>>> $1 = {mu_head = {slh_first = 0x0}, _pad = 0x0, mu_prio = 0x1, mu_level =
>>>>>> 0x1, mu_owner = 0x0}
>>>>> 
>>>>> 
>>>>> Why is mu_owner set to 0? That causes the os_mutex_release call to fail
>>>>> since the current task doesn't match the owner task anymore.
>>>>> 
>>>>> The task which holds the mutex looks like this:
>>>>> 
>>>>>> (gdb) p/x cent_task
>>>>>> $3 = {t_stackptr = 0x20008a28, t_stacktop = 0x20008ac8, t_stacksize =
>>>>>> 0x80, t_taskid = 0x6, t_prio = 0x1, t_state = 0x1, t_flags = 0x10,
>>>>>> t_lockcnt = 0x1, t_pad = 0x0,
>>>>>> t_name = 0x22378, t_func = 0x90ad, t_arg = 0x0, t_obj = 0x0,
>>>>>> t_sanity_check = {sc_checkin_last = 0x0, sc_checkin_itvl = 0x0, sc_func
>>>> =
>>>>>> 0x0, sc_arg = 0x0, sc_next = {
>>>>>>    sle_next = 0x0}}, t_next_wakeup = 0x0, t_run_time = 0x0,
>>>>>> t_ctx_sw_cnt = 0x213d, t_os_task_list = {stqe_next = 0x0}, t_os_list =
>>>>>> {tqe_next = 0x20001338,
>>>>>>  tqe_prev = 0x200001a8}, t_obj_list = {sle_next = 0x0}}
>>>>> 
>>>>> 
>>>>> Comparing t_prio and mu_prio, this confirms that this task is indeed
>>>>> holding the mutex (no other task is waiting on the mutex).
>>>>> 
>>>>> What can happen that set mu_owner to 0? My original theory was that if a
>>>>> mutex_pend was called from an interrupt context, mu_owner would be 0. But
>>>>> in this case, the only task that is calling mutex is running an eventq,
>>>> so
>>>>> that is unlikely.
>>>>> 
>>>>> Any ideas?
>>>>> 
>>>>> Jitesh
>>>>> 
>>>>> --
>>>>> This email including attachments contains Mad Apparel, Inc. DBA Athos
>>>>> privileged, confidential, and proprietary information solely for the use
>>>>> for the addressed recipients. If you are not the intended recipient,
>>>> please
>>>>> be aware that any review, disclosure, copying, distribution, or use of
>>>> the
>>>>> contents of this message is strictly prohibited. If you have received
>>>> this
>>>>> in error, please delete it immediately and notify the sender. All rights
>>>>> reserved by Mad Apparel, Inc. 2012. The information contained herein is
>>>> the
>>>>> exclusive property of Mad Apparel, Inc. and should not be used,
>>>>> distributed, reproduced, or disclosed in whole or in part without prior
>>>>> written permission of Mad Apparel, Inc.
>>>> 
>>>> 
>>> 
>>> -- 
>>> This email including attachments contains Mad Apparel, Inc. DBA Athos 
>>> privileged, confidential, and proprietary information solely for the use 
>>> for the addressed recipients. If you are not the intended recipient, please 
>>> be aware that any review, disclosure, copying, distribution, or use of the 
>>> contents of this message is strictly prohibited. If you have received this 
>>> in error, please delete it immediately and notify the sender. All rights 
>>> reserved by Mad Apparel, Inc. 2012. The information contained herein is the 
>>> exclusive property of Mad Apparel, Inc. and should not be used, 
>>> distributed, reproduced, or disclosed in whole or in part without prior 
>>> written permission of Mad Apparel, Inc.
> 

Reply via email to