Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically

Roman Haefeli Fri, 15 Mar 2013 02:44:57 -0700

On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote:
> > On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote:
> > > Hi,
> > > 
> > > On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote:
> > > > On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote:
> > > > > On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote:
> > > > > > On 2013-03-08T11:56:12, Roman Haefeli <[email protected]> wrote:
> > > > > > 
> > > > > > > Googling "TrackedProcTimeoutFunction exportfs" didn't reveal any
> > > > > > > results, which makes me think we are alone with this specific 
> > > > > > > problem.
> > > > > > > Is it the RA that hangs or the command 'exportfs' which is 
> > > > > > > executed by
> > > > > > > this RA? 
> > > 
> > > It is most probably the exportfs program. Unless you hit the
> > > "rmtab growing indefinitely" issue.
> > 
> > No, this is with a later version of the RA.
> > 
> > > > From the log:
> > > > Mar  8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop
> > > > process (PID 5528) timed out (try 2).  Killing with signal SIGKILL (9)
> > > 
> > > This means that the process didn't leave after being sent the
> > > TERM signal. I think that KILL takes place five seconds later.
> > > Was this with the "rmtab problem"?
> > 
> > I still don't fully understand. Is this  lrmd trying to kill the RA or
> > the process 'exportfs' with given PID?
> 
> The former. I thought I already answered that.


Yeah, sorry you did. Just for clarification: You say it's most likely
that the 'exportfs' process hangs and thus lrmd tries to kill the RA,
which will not exit until exportfs exits, is that correct?

> > > > For me valuable to know is what is lrmd trying to kill here: the process
> > > > 'exportfs' or the process of the resource agent?
> > > 
> > > The resource agent instance.
> > > 
> > > > I mean, is 'exportfs' broken on said machine?
> > > 
> > > Name resolution taking long perhaps?
> > 
> > We use IP addresses everywhere, so I assume it's not related to name
> > resolution. 
> > 
> > What can I do about a broken 'exportfs'? It happens so seldom that I
> > don't have a chance to deeply investigate the problem to write a proper
> > bug report.
> 
> Do you run the latest resource-agents (3.9.5)? Then you can
> trace the resource agent, like this:
> 
> primitive r ocf:heartbeat:exportfs \
>       params ... \
>       op stop trace_ra=1
> 
> The trace files will be generated per call in
> $HA_VARLIB/trace_ra/<type>/<id>.<action>.<timestamp>
> 
> HA_VARLIB is usually, I think, /var/lib/heartbeat.

Thanks, that is valuable information. Is it safe to only upgrade the
resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at
their current version?

Thanks,
Roman


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically

Reply via email to