Hi Anton,

It appears your diagnosis is right on target.  The problem seems to be in the 
IpmiSetAutoInsertTimeout() function in plugins/ipmidirect/ipmi.cpp.  At line 
#1227 it calls the IfLeave() method of the cIpmi object, which calls 
ReadUnlock(), without first setting the read lock.

The relevant change from 2.16 seems to be in openhpid/plugin.c.  In the 
oh_create_handler() function, it now calls 
handler->abi->set_autoinsert_timeout(), which it previously did not do.  This 
call maps to the IpmiSetAutoInsertTimeout() function - so it appears that this 
bug was present, but not hit prior to 2.17.

As far as I can tell, there is no reason that the read lock needs to be set in 
this function, as it does not reference any data that is protected by that 
lock, so I just commented out the call to ipmi->IfLeave(), and things now seem 
to work.

If you agree that this looks like a bug, then you should also look at the 
NewSimulatorGetAutoExtractTimeout() function in 
plugins/dynamic_simulator/new_sim.cpp, as it seems to have the same problem . . 
. .


David


> -----Original Message-----
> From: Anton Pak [mailto:[email protected]]
> Sent: Monday, September 26, 2011 12:33 AM
> To: [email protected]; David McKinley
> Subject: Re: [Openhpi-devel] Hang/Deadlock in 2.17 release
>
> David,
>
> There is a daemon level code in thread #5 working with handler list.
> There are two locks there: one for handler list and the other for
> handler.
> To get access to these locks one should call one of oh_xxx() functions
>  from the daemon.
> A plug-in usually doesn't work with these locks at all.
>
> On my recollection ipmidirect is quite able to successfully lock itself
> without any help from other threads.
> It often tries writelock() on the lock that has been already acquired
> with
> readlock().
> Or to to release lock that none has been acquired before (I suspect it
> is
> your case).
> POSIX says these operations lead to UB.
>
>       Anton Pak
>
> On Mon, 26 Sep 2011 06:22:32 +0400, David McKinley
> <[email protected]>
> wrote:
>
> > Anton,
> >
> > First, I pulled down the latest code on the trunk from svn, and
> checked
> > that the problem still exists - it did.
> >
> > As you suggested, I ran it with gdb, and did back traces for all
> > threads.  This was with the current trunk code.  There does seem to
> be a
> > deadlock between two threads, though I have not yet been able to
> track
> > it back to its root.  I ran the test two times, and these two threads
> > were both waiting to get locks in exactly the same places both times,
> so
> > it seems pretty clear that they are deadlocked.
> >
> > In the attached, the "interesting" threads are #5 and #3.  Thread #3
> is
> > the one that I can track progress on using the ipmidirect log file,
> and
> > sure enough, it is blocked waiting on a "write lock" for the domain
> > object immediately after reading the SEL entries.
> >
> > I'm guessing that the reason it cannot get that lock is because it is
> > owned by Thread #3, but I have not been able to verify this.  Thread
> #3
> > is waiting on a mutex for the handler.  The simplest case would be if
> > thread #3 grabbed the domain lock first, then tried to get the
> handler
> > lock, while Thread #5 held the handler lock and then went for the
> domain
> > lock.  Whether it is this simple, I don't know - but I would be
> > surprised if the deadlock isn't somehow between these two threads.
> >
> > Anyway, I'm learning a lot, as I look at it.  I'm hoping, though,
> that
> > you or someone else more familiar with the theory of operation here
> will
> > see the problem quicker than I am likely to be able to figure it out.
> >
> > Regards,
> >
> > David
> >
> >
> >> -----Original Message-----
> >> From: Anton Pak [mailto:[email protected]]
> >> Sent: Sunday, September 25, 2011 4:11 AM
> >> To: [email protected]; David McKinley
> >> Subject: Re: [Openhpi-devel] Hang/Deadlock in 2.17 release
> >>
> >> Suggest to run it under gdb and print stack trace for each thread
> when
> >> it
> >> hangs.
> >>
> >>       Anton Pak
> >>
> >> On Sun, 25 Sep 2011 07:43:04 +0400, David McKinley
> >> <[email protected]>
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > On my platform, which is a Sun Netra, using the ipmidirect plugin,
> >> > things seem to work fine on the 2.12, 2.14, and 2.16 release
> codes,
> >> but
> >> > with the 2.17 release code, it hangs during the discovery process.
> >> > Looking at the log file created by the ipmidirect plugin, it
> proceeds
> >> > through discovery to the point where it reads the SEL, but then
> never
> >> > logs anything else, and in particular never logs the message, "BMC
> >> > Discovery Done".  Meanwhile, in clients, calls to saHpiDiscover()
> >> hang.
> >> >
> >> > Backing out all the code changes in the ipmidirect plugin between
> >> 2.16
> >> > and 2.17 made no difference (there were very few, and apparently
> >> > trivial).  So, the problem seems to have been introduced
> elsewhere.
> >> I
> >> > looked through the tracker, and did not see any problem like this
> >> > reported.
> >> >
> >> > Given that I'm still very much a newbie in this codebase, I doubt
> >> that
> >> > I'll be able to track this down very quickly - and if the plugin
> is
> >> > working in other platforms, others should judge how much
> importance
> >> to
> >> > attach to this issue.  But, I did want to mention it, as it seems
> >> like
> >> > some sort of regression, at least on this platform.
> >> >
> >> > David
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Openhpi-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openhpi-devel

Reply via email to