Overview
When the puppet agent needs to run external processes on a node that can potentially hang, its not obvious to users that execution is waiting for a process to complete
Expected behaviour
Puppet should give feedback with users as to why it is 'stuck' to assure them that it has not crashed or allow them to troubleshoot why the dependent process is blocking
Actual behaviour
Puppet runs can appear to 'stick' for several minutes at certain points under some conditions.
-
sometimes this is expected and the actions being undertaken are obvious, eg running a packaged installer via an exec would generate load and disk activity on the agent system
-
Other times, activity can pause for no apparent reason, with no visible load or log messages to indicate what is happening
-
To all intents and purposes Puppet appears to have 'crashed'
-
Many users will attempt to rectify this by killing the stuck puppet process and restarting the daemon or will simply reboot their machine. This will cause the Puppet agent to attempt another run in the background immediately when the system is back online which will cause user invocations of puppet agent -t to display the message:
Notice: Run of Puppet configuration client already in progress; skipping (/var/opt/lib/pe-puppet/state/agent_catalog_run.lock exists)
|
-
Due to the original condition still existing, this background puppet run may continue executing for several minutes or users may end up killing it
Problems this causes
-
Puppet can appear to "stick" or pause for several minutes on every puppet run
-
Its not obvious why this pausing is occurring in low load situations (although an experience user may use this to deduce they are waiting for network traffic and investigate active network processes)
-
There is typically no logging information available to indicate what activity is being waited for or why puppet is waiting
-
Users and experts alike can be stumped by this hours - puppet appears to be broken but its not, its simply blocked by other OS processes
Steps to reproduce
You can emulate a hung process by breaking the yum command on redhat: 1. Break internet access (emulate a corporate firewall)
/sbin/iptables -A OUTPUT -p tcp --destination-port 80 -j DROP
|
/sbin/iptables -A OUTPUT -p tcp --destination-port 443 -j DROP
|
2. Clear your yum metadata so that it needs to be downloaded again:
3. Ensure you have at least one yum repository that is configured for internet access:
-
manually install and enable the EPEL repository
-
or do this with puppet:
class { 'epel':
|
epel_enabled => true,
|
}
|
4. Run any puppet code that does things with yum, eg:
package { "nothere":
|
ensure => installed,
|
}
|
You should observe both agent and apply runs of puppet blocking until the yum command returns. This can take about 7 minutes.
diagnosis
You can confirm whether your affected by this issue by running puppet in debug mode (puppet agent -t --debug) - If you see execution 'sticking' after a particular command such as yum, then this is likely the external program you need to investigate and fix.
Alternatively, you could inspect the output of pstree or look for suspicious processes with ps -eaf
Potential solutions
Puppet needs to communicate what it doing back to users without being overly verbose. This will prevent users having the perception that puppet has died when it is simply waiting for a blocking process to complete.
A nice solution here might be to have puppet execute potentially blocking operations in separate process or thread. This would allow puppet to keep an eye on the blocked process and maybe output a message every minute or so, eg:
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Info: waited 60 seconds for /usr/bin/yum -d 0 -e 0 -y list nothere to complete
|
Such output would be really helpful to sysadmins as it makes real cause of the slowness immediately visible
Applicability
While the incident that inspired this bug report was one related to directly to yum blocking while trying to update its metadata, this suggestion is applicable to all external components that Puppet runs block for.
It would be a really nice usability enhancement if the above could be implemented each time puppet needs to run an external command.
If we do this right, it could potentially aid in troubleshooting when using third party types and providers via the command method as these would be wrapped by the same logic for free.
This could lead to messages such as this (in the case of a long rsync run:
Info: waited 60 seconds for /usr/bin/rsync ... to complete
|
Info: waited 60 seconds for /usr/bin/rsync ... to complete
|
Info: waited 60 seconds for /usr/bin/rsync ... to complete
|
Info: waited 60 seconds for /usr/bin/rsync ... to complete
|
Info: waited 60 seconds for /usr/bin/rsync ... to complete
|
being displayed which makes it immediately obvious what we are waiting for should this be a problem
|