On Mon, 1 Jan 2018, at 23:51, Matt Wise wrote: > *Puppet Agent: 5.3.2* > *Puppet Server: 5.1.4 - Packaged in Docker, running on Amazon ECS* > > So we've recently started rolling over from our ancient Puppet 3.x system > to a new Puppet 5.x service. The new service consists of a PuppetServer > Docker Image (5.1.4) running in Amazon ECS, and our hosts booting up and > running Puppet Agent 5.3.2. At this point in the migration, we're running > ~150-200 hosts on the new Puppet5 system and we replace ~30-80 of them > daily. > > We are currently tracking down a problem with our PuppetServers and their > memory usage, which is causing the containers to be OOM'd a few times a day > (~10 OOMs a day across ~20 containers). While we know that we need to fix > this, we've seen a scary behavior on the Puppet Agent side that we could > use some advice with. > > It seems that at least a few times a day now we will get a server hung in > the boot process. The `puppet agent -t ...` process will just hang midway > through the run. It seems that these hangs happen when the backend > underlying PuppetServer process that they were connected to gets OOMed and > goes away. Obviously the OOM is a problem.. but frankly I am more concerned > with the Puppet Agent getting wedged for hours and hours without making any > progress. > > It seems that when this failure happens, the puppet agent does not ever > time out. It never fails, or throws an error. It just hangs. We've had > these hangs last upwards of 4-5 hours before our systems are automatically > terminated. > > We've enabled debug logging, but haven't caught one of these failures yet > with debug mode turned on. In the mean time, are there any known > regressions or configuration tweaks we need to make to Puppet Agent 5.x > more quick to fail or resilient in this case? I could obviously try to > build in some wrapper around Puppet to catch this behavior .. but I am > hoping that there are just some settings we need to tweak.
I see this often for other kinds of interruptions like network interruptions etc I do recall a number of bugs around this to make it more robust, you might want to try searching Puppet jita -- R.I.Pienaar / www.devco.net / @ripienaar -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/1514847264.1185405.1221159992.28D2AE6B%40webmail.messagingengine.com. For more options, visit https://groups.google.com/d/optout.
