Jira (PUP-4415) hung OS processes stall puppet (agent|apply) runs on client

Geoff Williams (JIRA) Wed, 15 Apr 2015 01:40:18 -0700

Title: Message Title

Geoff Williams created an issue

PUP-4415

Issue Type:	Bug
Affects Versions:	PUP 3.7.4
Assignee:	Eric Sorenson
Components:	Platform, UX
Created:	2015/04/15 1:39 AM
Environment:	centos 6.5 with internet access blocked by corporate firewall (but affects all platforms)
Priority:	Normal
Reporter:	Geoff Williams

Overview

When the puppet agent needs to run external processes on a node that can potentially hang, its not obvious to users that execution is waiting for a process to complete

Expected behaviour

Puppet should give feedback with users as to why it is 'stuck' to assure them that it has not crashed or allow them to troubleshoot why the dependent process is blocking

Actual behaviour

Puppet runs can appear to 'stick' for several minutes at certain points under some conditions.

sometimes this is expected and the actions being undertaken are obvious, eg running a packaged installer via an exec would generate load and disk activity on the agent system
Other times, activity can pause for no apparent reason, with no visible load or log messages to indicate what is happening
To all intents and purposes Puppet appears to have 'crashed'

Many users will attempt to rectify this by killing the stuck puppet process and restarting the daemon or will simply reboot their machine. This will cause the Puppet agent to attempt another run in the background immediately when the system is back online which will cause user invocations of puppet agent -t to display the message:

Notice: Run of Puppet configuration client already in progress; skipping  (/var/opt/lib/pe-puppet/state/agent_catalog_run.lock exists)

Due to the original condition still existing, this background puppet run may continue executing for several minutes or users may end up killing it

Problems this causes

Puppet can appear to "stick" or pause for several minutes on every puppet run
Its not obvious why this pausing is occurring in low load situations (although an experience user may use this to deduce they are waiting for network traffic and investigate active network processes)
There is typically no logging information available to indicate what activity is being waited for or why puppet is waiting
Users and experts alike can be stumped by this hours - puppet appears to be broken but its not, its simply blocked by other OS processes

Steps to reproduce

You can emulate a hung process by breaking the yum command on redhat:
1. Break internet access (emulate a corporate firewall)

/sbin/iptables -A OUTPUT -p tcp --destination-port 80 -j DROP

/sbin/iptables -A OUTPUT -p tcp --destination-port 443 -j DROP

2. Clear your yum metadata so that it needs to be downloaded again:

yum clean all

3. Ensure you have at least one yum repository that is configured for internet access:

manually install and enable the EPEL repository

or do this with puppet:

class { 'epel':

  epel_enabled => true,

4. Run any puppet code that does things with yum, eg:

package { "nothere":

  ensure => installed,

You should observe both agent and apply runs of puppet blocking until the yum command returns. This can take about 7 minutes.

diagnosis

You can confirm whether your affected by this issue by running puppet in debug mode (puppet agent -t --debug) - If you see execution 'sticking' after a particular command such as yum, then this is likely the external program you need to investigate and fix.

Alternatively, you could inspect the output of pstree or look for suspicious processes with ps -eaf

Potential solutions

Puppet needs to communicate what it doing back to users without being overly verbose. This will prevent users having the perception that puppet has died when it is simply waiting for a blocking process to complete.

A nice solution here might be to have puppet execute potentially blocking operations in separate process or thread. This would allow puppet to keep an eye on the blocked process and maybe output a message every minute or so, eg:

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete

Such output would be really helpful to sysadmins as it makes real cause of the slowness immediately visible

Applicability

While the incident that inspired this bug report was one related to directly to yum blocking while trying to update its metadata, this suggestion is applicable to all external components that Puppet runs block for.

It would be a really nice usability enhancement if the above could be implemented each time puppet needs to run an external command.

If we do this right, it could potentially aid in troubleshooting when using third party types and providers via the command method as these would be wrapped by the same logic for free.

This could lead to messages such as this (in the case of a long rsync run:

Info: waited 60 seconds for /usr/bin/rsync ...  to complete

Info: waited 60 seconds for /usr/bin/rsync ...  to complete

Info: waited 60 seconds for /usr/bin/rsync ...  to complete

Info: waited 60 seconds for /usr/bin/rsync ...  to complete

Info: waited 60 seconds for /usr/bin/rsync ...  to complete

being displayed which makes it immediately obvious what we are waiting for should this be a problem

Add Comment

This message was sent by Atlassian JIRA

--
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/puppet-bugs.
For more options, visit https://groups.google.com/d/optout.