I’d start by looking for memory corruption. You could try adding guard variables around your send_mutex(), and see if anything stomps on them. Another option could be to change mutex_release() to write something other than 0 at mu_owner, and then add a conditional hardware watchpoint which looks if anything tries to zero out mu_owner of your send_mutex.
Good luck with the hunt. > On Nov 6, 2017, at 3:39 PM, will sanfilippo <[email protected]> wrote: > > Yeah, Chris is right here. I did not read the email thoroughly enough and if > what I described happened, the owner would not be NULL. Sorry about that. > > So while it would explain lockcnt and level, it would not explain why the > owner is NULL, as failing to release the mutex would have the owner set to > something. > > > >> On Nov 6, 2017, at 3:33 PM, Christopher Collins <[email protected]> wrote: >> >> I agree that a mutex should never have a null owner and a nonzero level. >> >> Unfortunately, my first guess is some form of memory corruption: >> it seems like a null value accidentally got written to `mu_owner`. I >> could be missing it, but I don't see any logic error in the mutex code >> which could cause this. >> >> Getting to the bottom of this is probably going to be difficult, >> especially if it is not easy to reproduce. I don't know how valuable >> they are, but my two suggestions are: >> >> 1. Look at the `.lst` file that newt generates during a build to >> determine what object immediately follows the mutex in RAM. Maybe an >> errant write intended for this object is clearing the owner field. >> >> 2. Instrument the code with a bunch of asserts and logs. Maybe you can >> catch the problem shortly after it happens. >> >> Like I said, probably not the most helpful advice, but I don't think >> this is going to be an easy one to solve! >> >> Chris >> >> On Mon, Nov 06, 2017 at 03:16:06PM -0800, Jitesh Shah wrote: >>> Hey wil, >>> Are you saying that because "mu_level" is set to 1? >>> >>> It is set to 1 because the last call to os_mutex_release() failed on >>> account of "mu_owner" not matching. Thus, the task that got the mutex >>> failed to release it. That explains t_lockcnt and mu_level, right? >>> >>> Jitesh >>> >>> On Mon, Nov 6, 2017 at 7:56 AM, will sanfilippo <[email protected]> wrote: >>> >>>> What this looks like to me is that there was a nested pend without the >>>> same number of releases. Maybe some path out of some code that is rarely >>>> hit where a mutex is granted but not released? >>>> >>>> Just a guess... >>>> >>>>> On Nov 5, 2017, at 8:26 PM, Jitesh Shah <[email protected]> wrote: >>>>> >>>>> Hey Guys, >>>>> I am running v1.0.0 branch (0db6321a75deda126943aa187842da6b977cd1c1). >>>>> Seeing some strange mutex behaviour. >>>>> >>>>> So once in a bazillion times, a mutex fails to release. Here is how the >>>>> structure looks like when it fails: >>>>> >>>>>> (gdb) p/x send_mutex >>>>>> $1 = {mu_head = {slh_first = 0x0}, _pad = 0x0, mu_prio = 0x1, mu_level = >>>>>> 0x1, mu_owner = 0x0} >>>>> >>>>> >>>>> Why is mu_owner set to 0? That causes the os_mutex_release call to fail >>>>> since the current task doesn't match the owner task anymore. >>>>> >>>>> The task which holds the mutex looks like this: >>>>> >>>>>> (gdb) p/x cent_task >>>>>> $3 = {t_stackptr = 0x20008a28, t_stacktop = 0x20008ac8, t_stacksize = >>>>>> 0x80, t_taskid = 0x6, t_prio = 0x1, t_state = 0x1, t_flags = 0x10, >>>>>> t_lockcnt = 0x1, t_pad = 0x0, >>>>>> t_name = 0x22378, t_func = 0x90ad, t_arg = 0x0, t_obj = 0x0, >>>>>> t_sanity_check = {sc_checkin_last = 0x0, sc_checkin_itvl = 0x0, sc_func >>>> = >>>>>> 0x0, sc_arg = 0x0, sc_next = { >>>>>> sle_next = 0x0}}, t_next_wakeup = 0x0, t_run_time = 0x0, >>>>>> t_ctx_sw_cnt = 0x213d, t_os_task_list = {stqe_next = 0x0}, t_os_list = >>>>>> {tqe_next = 0x20001338, >>>>>> tqe_prev = 0x200001a8}, t_obj_list = {sle_next = 0x0}} >>>>> >>>>> >>>>> Comparing t_prio and mu_prio, this confirms that this task is indeed >>>>> holding the mutex (no other task is waiting on the mutex). >>>>> >>>>> What can happen that set mu_owner to 0? My original theory was that if a >>>>> mutex_pend was called from an interrupt context, mu_owner would be 0. But >>>>> in this case, the only task that is calling mutex is running an eventq, >>>> so >>>>> that is unlikely. >>>>> >>>>> Any ideas? >>>>> >>>>> Jitesh >>>>> >>>>> -- >>>>> This email including attachments contains Mad Apparel, Inc. DBA Athos >>>>> privileged, confidential, and proprietary information solely for the use >>>>> for the addressed recipients. If you are not the intended recipient, >>>> please >>>>> be aware that any review, disclosure, copying, distribution, or use of >>>> the >>>>> contents of this message is strictly prohibited. If you have received >>>> this >>>>> in error, please delete it immediately and notify the sender. All rights >>>>> reserved by Mad Apparel, Inc. 2012. The information contained herein is >>>> the >>>>> exclusive property of Mad Apparel, Inc. and should not be used, >>>>> distributed, reproduced, or disclosed in whole or in part without prior >>>>> written permission of Mad Apparel, Inc. >>>> >>>> >>> >>> -- >>> This email including attachments contains Mad Apparel, Inc. DBA Athos >>> privileged, confidential, and proprietary information solely for the use >>> for the addressed recipients. If you are not the intended recipient, please >>> be aware that any review, disclosure, copying, distribution, or use of the >>> contents of this message is strictly prohibited. If you have received this >>> in error, please delete it immediately and notify the sender. All rights >>> reserved by Mad Apparel, Inc. 2012. The information contained herein is the >>> exclusive property of Mad Apparel, Inc. and should not be used, >>> distributed, reproduced, or disclosed in whole or in part without prior >>> written permission of Mad Apparel, Inc. >
