> On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> > The application is a multi-thread program, to use the pairs of
> > mutex_lock and mutex_unlock to protect the shared data structure. The
> > type of this mutex is PTHREAD_MUTEX_PI_RECURSIVE_NP. After running
> > long time, to say several days, the mutex_lock data structure in user
> space looks like corrupt.
> >
> >    thread 0 can do mutex_lock/unlock
> >    __lock = this thread | FUTEX_WAITERS
> >    __owner = 0, should be this thread
> 
> The kernel does not know about __owner.

Correct, it shows the last failure is in mutex_unlock, 
which clear the __owner in user space.

> 
> >    __counter keep increasing, although there is no recursive mutex_lock
> call.
> >
> >    thread 1 will be stuck
> > 
> > The primary debugging shows the content of __lock is wrong in first.
> > After a call of Mutex_unlock, the value of __lock should not be this
> > thread self. But we observed The value of __lock is still self after
> > unlock. So, other threads will be stuck,
> 
> How did you observe that?

Add one assert in mutex_unlock, after it finish the __lock modify either in
User space or kernel space, before return.

> 
> > This thread could lock due to recursive type and __counter keep
> > increasing, although mutex_unlock return fails, due to the wrong value
> > of __owner, but the application did not check the return value. So the
> > thread 0 looks like fine. But thread 1 will be stuck forever.
> 
> Oh well. So thread 0 looks all fine, despite not checking return values.
> 

Correct.

Actually, I'm not clear how about the state changing of futex in kernel.
I search the Internet, see a similar failure from other users. He is using 
Kernel 2.6.38. Our customer is using kernel 2.6.34 (WindRiver Linux 4.1)

    ====
    http://www.programdoc.com/1272_157986_1.htm

    Maybe, there is a bug about pi-futex, it would let the program in 
    user-space going to hang.
    We have a board: CPU is powerpc 8572, two core. after ran one month, 
    the state of pi-futex in user-space got bad: 
    mutex->__data.__lock is 0x8000023e, 
    mutex->__data.__count is 0, 
    mutex->__data.__owner is 0.

But I can not understand the sample failure case which he mentioned. But I think
It might be helpful for you to analyze the corner case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to