Anton,

First, I pulled down the latest code on the trunk from svn, and checked that 
the problem still exists - it did.

As you suggested, I ran it with gdb, and did back traces for all threads.  This 
was with the current trunk code.  There does seem to be a deadlock between two 
threads, though I have not yet been able to track it back to its root.  I ran 
the test two times, and these two threads were both waiting to get locks in 
exactly the same places both times, so it seems pretty clear that they are 
deadlocked.

In the attached, the "interesting" threads are #5 and #3.  Thread #3 is the one 
that I can track progress on using the ipmidirect log file, and sure enough, it 
is blocked waiting on a "write lock" for the domain object immediately after 
reading the SEL entries.

I'm guessing that the reason it cannot get that lock is because it is owned by 
Thread #3, but I have not been able to verify this.  Thread #3 is waiting on a 
mutex for the handler.  The simplest case would be if thread #3 grabbed the 
domain lock first, then tried to get the handler lock, while Thread #5 held the 
handler lock and then went for the domain lock.  Whether it is this simple, I 
don't know - but I would be surprised if the deadlock isn't somehow between 
these two threads.

Anyway, I'm learning a lot, as I look at it.  I'm hoping, though, that you or 
someone else more familiar with the theory of operation here will see the 
problem quicker than I am likely to be able to figure it out.

Regards,

David


> -----Original Message-----
> From: Anton Pak [mailto:[email protected]]
> Sent: Sunday, September 25, 2011 4:11 AM
> To: [email protected]; David McKinley
> Subject: Re: [Openhpi-devel] Hang/Deadlock in 2.17 release
>
> Suggest to run it under gdb and print stack trace for each thread when
> it
> hangs.
>
>       Anton Pak
>
> On Sun, 25 Sep 2011 07:43:04 +0400, David McKinley
> <[email protected]>
> wrote:
>
> > Hello,
> >
> > On my platform, which is a Sun Netra, using the ipmidirect plugin,
> > things seem to work fine on the 2.12, 2.14, and 2.16 release codes,
> but
> > with the 2.17 release code, it hangs during the discovery process.
> > Looking at the log file created by the ipmidirect plugin, it proceeds
> > through discovery to the point where it reads the SEL, but then never
> > logs anything else, and in particular never logs the message, "BMC
> > Discovery Done".  Meanwhile, in clients, calls to saHpiDiscover()
> hang.
> >
> > Backing out all the code changes in the ipmidirect plugin between
> 2.16
> > and 2.17 made no difference (there were very few, and apparently
> > trivial).  So, the problem seems to have been introduced elsewhere.
> I
> > looked through the tracker, and did not see any problem like this
> > reported.
> >
> > Given that I'm still very much a newbie in this codebase, I doubt
> that
> > I'll be able to track this down very quickly - and if the plugin is
> > working in other platforms, others should judge how much importance
> to
> > attach to this issue.  But, I did want to mention it, as it seems
> like
> > some sort of regression, at least on this platform.
> >
> > David
(gdb) thread apply all backtrace

Thread 6 (Thread 0xb57e4b90 (LWP 16732)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x0084bbc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00161582 in ?? () from /lib/libglib-2.0.so.0
#3  0x00161925 in g_async_queue_pop () from /lib/libglib-2.0.so.0
#4  0x0805ae49 in oh_process_events () at event.c:529
#5  0x08070882 in evtpop_func (data=0x0) at threaded.c:105
#6  0x0019bf1f in ?? () from /lib/libglib-2.0.so.0
#7  0x00847832 in start_thread () from /lib/libpthread.so.0
#8  0x007870ae in clone () from /lib/libc.so.6

Thread 5 (Thread 0xb61e5b90 (LWP 16731)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x0084e6e9 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x00849d9f in _L_lock_885 () from /lib/libpthread.so.0
#3  0x00849c66 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x0019c6e0 in g_static_rec_mutex_lock () from /lib/libglib-2.0.so.0
#5  0x0805d0d5 in oh_get_handler (hid=1) at plugin.c:406
#6  0x0805a056 in oh_harvest_events () at event.c:198
#7  0x08070702 in evtget_func (data=0x0) at threaded.c:81
#8  0x0019bf1f in ?? () from /lib/libglib-2.0.so.0
#9  0x00847832 in start_thread () from /lib/libpthread.so.0
#10 0x007870ae in clone () from /lib/libc.so.6

Thread 4 (Thread 0xb6be6b90 (LWP 16730)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x00746d26 in nanosleep () from /lib/libc.so.6
#2  0x0078050c in usleep () from /lib/libc.so.6
#3  0x0020b2fe in cIpmi::IfDiscoverResources (this=0x808da00) at ipmi.cpp:2271
#4  0x0020e783 in IpmiDiscoverResources (hnd=0x808b310) at ipmi.cpp:475
#5  0x0805daa5 in oh_discovery () at plugin.c:664
#6  0x080704db in discovery_func (data=0x0) at threaded.c:48
#7  0x0019bf1f in ?? () from /lib/libglib-2.0.so.0
#8  0x00847832 in start_thread () from /lib/libpthread.so.0
#9  0x007870ae in clone () from /lib/libc.so.6

Thread 3 (Thread 0xb75e7b90 (LWP 16729)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x0084b31b in pthread_rwlock_wrlock () from /lib/libpthread.so.0
#2  0x0023f8ae in cThreadLockRw::WriteLock (this=0x808e764) at thread.cpp:269
#3  0x0021a07a in cIpmiDomain::WriteLock (this=0x808da00) at ipmi_domain.h:172
#4  0x00217bdc in cIpmiMcThread::WriteLock (this=0x80a3798) at 
ipmi_discover.cpp:70
#5  0x00218c00 in cIpmiMcThread::Discover (this=0x80a3798, 
get_device_id_rsp=0xb75e72a8) at ipmi_discover.cpp:380
#6  0x0021986c in cIpmiMcThread::Run (this=0x80a3798) at ipmi_discover.cpp:192
#7  0x0023fd67 in cThread::Thread (param=0x80a3798) at thread.cpp:108
#8  0x00847832 in start_thread () from /lib/libpthread.so.0
#9  0x007870ae in clone () from /lib/libc.so.6

Thread 2 (Thread 0xb7fe8b90 (LWP 16722)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x0077d3d3 in poll () from /lib/libc.so.6
#2  0x00211209 in cIpmiCon::Run (this=0x808f940) at ipmi_con.cpp:270
#3  0x0023fd67 in cThread::Thread (param=0x808f940) at thread.cpp:108
#4  0x00847832 in start_thread () from /lib/libpthread.so.0
#5  0x007870ae in clone () from /lib/libc.so.6

Thread 1 (Thread 0xb7fe96e0 (LWP 16719)):
#0  0x00152402 in __kernel_vsyscall ()
#1  0x0084ed18 in accept () from /lib/libpthread.so.0
#2  0x00e903aa in cServerStreamSock::Accept (this=0x808d4a0) at strmsock.cpp:512
#3  0x0804de7f in oh_server_run (ipvflags=1, bindaddr=0x0, port=4743, 
sock_timeout=0, max_threads=-1) at server.cpp:163
#4  0x08059b5e in main (argc=Cannot access memory at address 0x0) at 
openhpid-posix.cpp:427
(gdb)
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Openhpi-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openhpi-devel

Reply via email to