Re: [Openhpi-devel] Hang/Deadlock in 2.17 release

Anton Pak Mon, 26 Sep 2011 11:00:35 -0700

David,

Good!
Suggest to create bug tickets.
Keep in mind that was not the only deadlock place.
Over the years I encountered many such hidden issues in the ipmidirect  
code base.


        Anton Pak

On Mon, 26 Sep 2011 21:47:12 +0400, David McKinley <[email protected]>  
wrote:

> Hi Anton,
>
> It appears your diagnosis is right on target.  The problem seems to be  
> in the IpmiSetAutoInsertTimeout() function in  
> plugins/ipmidirect/ipmi.cpp.  At line #1227 it calls the IfLeave()  
> method of the cIpmi object, which calls ReadUnlock(), without first  
> setting the read lock.
>
> The relevant change from 2.16 seems to be in openhpid/plugin.c.  In the  
> oh_create_handler() function, it now calls  
> handler->abi->set_autoinsert_timeout(), which it previously did not do.   
> This call maps to the IpmiSetAutoInsertTimeout() function - so it  
> appears that this bug was present, but not hit prior to 2.17.
>
> As far as I can tell, there is no reason that the read lock needs to be  
> set in this function, as it does not reference any data that is  
> protected by that lock, so I just commented out the call to  
> ipmi->IfLeave(), and things now seem to work.
>
> If you agree that this looks like a bug, then you should also look at  
> the NewSimulatorGetAutoExtractTimeout() function in  
> plugins/dynamic_simulator/new_sim.cpp, as it seems to have the same  
> problem . . . .
>
>
> David
>
>
>> -----Original Message-----
>> From: Anton Pak [mailto:[email protected]]
>> Sent: Monday, September 26, 2011 12:33 AM
>> To: [email protected]; David McKinley
>> Subject: Re: [Openhpi-devel] Hang/Deadlock in 2.17 release
>>
>> David,
>>
>> There is a daemon level code in thread #5 working with handler list.
>> There are two locks there: one for handler list and the other for
>> handler.
>> To get access to these locks one should call one of oh_xxx() functions
>>  from the daemon.
>> A plug-in usually doesn't work with these locks at all.
>>
>> On my recollection ipmidirect is quite able to successfully lock itself
>> without any help from other threads.
>> It often tries writelock() on the lock that has been already acquired
>> with
>> readlock().
>> Or to to release lock that none has been acquired before (I suspect it
>> is
>> your case).
>> POSIX says these operations lead to UB.
>>
>>       Anton Pak
>>
>> On Mon, 26 Sep 2011 06:22:32 +0400, David McKinley
>> <[email protected]>
>> wrote:
>>
>> > Anton,
>> >
>> > First, I pulled down the latest code on the trunk from svn, and
>> checked
>> > that the problem still exists - it did.
>> >
>> > As you suggested, I ran it with gdb, and did back traces for all
>> > threads.  This was with the current trunk code.  There does seem to
>> be a
>> > deadlock between two threads, though I have not yet been able to
>> track
>> > it back to its root.  I ran the test two times, and these two threads
>> > were both waiting to get locks in exactly the same places both times,
>> so
>> > it seems pretty clear that they are deadlocked.
>> >
>> > In the attached, the "interesting" threads are #5 and #3.  Thread #3
>> is
>> > the one that I can track progress on using the ipmidirect log file,
>> and
>> > sure enough, it is blocked waiting on a "write lock" for the domain
>> > object immediately after reading the SEL entries.
>> >
>> > I'm guessing that the reason it cannot get that lock is because it is
>> > owned by Thread #3, but I have not been able to verify this.  Thread
>> #3
>> > is waiting on a mutex for the handler.  The simplest case would be if
>> > thread #3 grabbed the domain lock first, then tried to get the
>> handler
>> > lock, while Thread #5 held the handler lock and then went for the
>> domain
>> > lock.  Whether it is this simple, I don't know - but I would be
>> > surprised if the deadlock isn't somehow between these two threads.
>> >
>> > Anyway, I'm learning a lot, as I look at it.  I'm hoping, though,
>> that
>> > you or someone else more familiar with the theory of operation here
>> will
>> > see the problem quicker than I am likely to be able to figure it out.
>> >
>> > Regards,
>> >
>> > David
>> >
>> >
>> >> -----Original Message-----
>> >> From: Anton Pak [mailto:[email protected]]
>> >> Sent: Sunday, September 25, 2011 4:11 AM
>> >> To: [email protected]; David McKinley
>> >> Subject: Re: [Openhpi-devel] Hang/Deadlock in 2.17 release
>> >>
>> >> Suggest to run it under gdb and print stack trace for each thread
>> when
>> >> it
>> >> hangs.
>> >>
>> >>       Anton Pak
>> >>
>> >> On Sun, 25 Sep 2011 07:43:04 +0400, David McKinley
>> >> <[email protected]>
>> >> wrote:
>> >>
>> >> > Hello,
>> >> >
>> >> > On my platform, which is a Sun Netra, using the ipmidirect plugin,
>> >> > things seem to work fine on the 2.12, 2.14, and 2.16 release
>> codes,
>> >> but
>> >> > with the 2.17 release code, it hangs during the discovery process.
>> >> > Looking at the log file created by the ipmidirect plugin, it
>> proceeds
>> >> > through discovery to the point where it reads the SEL, but then
>> never
>> >> > logs anything else, and in particular never logs the message, "BMC
>> >> > Discovery Done".  Meanwhile, in clients, calls to saHpiDiscover()
>> >> hang.
>> >> >
>> >> > Backing out all the code changes in the ipmidirect plugin between
>> >> 2.16
>> >> > and 2.17 made no difference (there were very few, and apparently
>> >> > trivial).  So, the problem seems to have been introduced
>> elsewhere.
>> >> I
>> >> > looked through the tracker, and did not see any problem like this
>> >> > reported.
>> >> >
>> >> > Given that I'm still very much a newbie in this codebase, I doubt
>> >> that
>> >> > I'll be able to track this down very quickly - and if the plugin
>> is
>> >> > working in other platforms, others should judge how much
>> importance
>> >> to
>> >> > attach to this issue.  But, I did want to mention it, as it seems
>> >> like
>> >> > some sort of regression, at least on this platform.
>> >> >
>> >> > David
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
> _______________________________________________
> Openhpi-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/openhpi-devel

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Openhpi-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openhpi-devel

Re: [Openhpi-devel] Hang/Deadlock in 2.17 release

Reply via email to