On Fri, Mar 15, 2013 at 10:44:37AM +0100, Roman Haefeli wrote: > On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote: > > Hi, > > > > On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote: > > > On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote: > > > > Hi, > > > > > > > > On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote: > > > > > On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote: > > > > > > On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote: > > > > > > > On 2013-03-08T11:56:12, Roman Haefeli <[email protected]> wrote: > > > > > > > > > > > > > > > Googling "TrackedProcTimeoutFunction exportfs" didn't reveal any > > > > > > > > results, which makes me think we are alone with this specific > > > > > > > > problem. > > > > > > > > Is it the RA that hangs or the command 'exportfs' which is > > > > > > > > executed by > > > > > > > > this RA? > > > > > > > > It is most probably the exportfs program. Unless you hit the > > > > "rmtab growing indefinitely" issue. > > > > > > No, this is with a later version of the RA. > > > > > > > > From the log: > > > > > Mar 8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop > > > > > process (PID 5528) timed out (try 2). Killing with signal SIGKILL (9) > > > > > > > > This means that the process didn't leave after being sent the > > > > TERM signal. I think that KILL takes place five seconds later. > > > > Was this with the "rmtab problem"? > > > > > > I still don't fully understand. Is this lrmd trying to kill the RA or > > > the process 'exportfs' with given PID? > > > > The former. I thought I already answered that. > > Yeah, sorry you did. Just for clarification: You say it's most likely > that the 'exportfs' process hangs and thus lrmd tries to kill the RA, > which will not exit until exportfs exits, is that correct?
Right. > > > > > For me valuable to know is what is lrmd trying to kill here: the > > > > > process > > > > > 'exportfs' or the process of the resource agent? > > > > > > > > The resource agent instance. > > > > > > > > > I mean, is 'exportfs' broken on said machine? > > > > > > > > Name resolution taking long perhaps? > > > > > > We use IP addresses everywhere, so I assume it's not related to name > > > resolution. > > > > > > What can I do about a broken 'exportfs'? It happens so seldom that I > > > don't have a chance to deeply investigate the problem to write a proper > > > bug report. > > > > Do you run the latest resource-agents (3.9.5)? Then you can > > trace the resource agent, like this: > > > > primitive r ocf:heartbeat:exportfs \ > > params ... \ > > op stop trace_ra=1 > > > > The trace files will be generated per call in > > $HA_VARLIB/trace_ra/<type>/<id>.<action>.<timestamp> > > > > HA_VARLIB is usually, I think, /var/lib/heartbeat. > > Thanks, that is valuable information. Is it safe to only upgrade the > resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at > their current version? Yes, you can update them independently. Thanks, Dejan > Thanks, > Roman > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
