I'm not 100% sure if the subject correctly describes the problem I've
been having, but it's the closest I can get with my troubleshooting.
My setup looks like this:

* 2 puppetmasters running 0.25.4 on Ubuntu, running under passenger
* backend content (etc and var) shared over NFS
* haproxy load balancing across the 2 puppetmasters
* mysql for stored configs

I just upgraded from 0.24.8 to 0.25.4 a couple of weeks ago.  The
setup we've been using above has worked fine since we implemented it
months ago, so I don't believe that there is any problem with NFS or
the load balancer.  I have a handful of custom functions, and after
updating to 0.25.4, puppetmaster started complaining about one of
them, a simple function called nagios_name.  This function takes an
FQDN and turns it into a name we use in Nagios and mcollective
(turning "support.arces.net" into "arces.support" for example).  The
function is basic ruby and is available for you to look at here:
http://monachus.pastebin.com/yLF1syqU.  The function works fine.

The error that puppetmaster reports is:

Unknown function nagios_name at /var/www/localhost/puppet/etc/
manifests/outsidein_nodes.pp:16 on node some.node.com.

It doesn't report this all of the time - instead it reports it about
40% of the time, while other nodes before and after it do not report
the error.  It seems that a node with a problem will always have the
problem, and a node where it works will always work.  This reinforces
the fact that the function is fine - it works and has worked for
months.

My thought is that it's some sort of caching issue, and I even thought
it might be a race condition with the backend storage being NFS - one
puppetmaster loading a cached yaml file before the other was done
writing it or something.  I've done all of the following, all with no
success:

* turn off one puppetmaster so traffic isn't split across them
* move yaml files for node/facts to local storage instead of NFS
* enable IP-based persistence in haproxy so that traffic from a client
always goes to the same puppetmaster
* --ignorecache in config.ru for puppetmaster

What I've discovered, however, is more interesting.  It appears that
if I go into the actual nagios_name.rb file and change it in any way
(add a single character of whitespace) and restart Apache, the error
goes away.  The file is detected as different and loaded for delivery
to the clients, and everything works fine after that.  I discovered
this by adding debug() statements to the function 2 weeks ago, only to
find that it worked fine from then on.  The problem resurfaced today
when I turned the 2nd puppetmaster back on, and I decided to try it
with whitespace - same thing.  Clears it right up.  This tells me that
there is some sort of caching wonkiness happening somewhere, but I'm
not able to figure out where.

Perhaps one of the variables the function is looking for (fqdn?) isn't
available at the time it's requested, resulting in a compile error
that isn't always visible?

I'm pleased to have a workaround, but to go from "Unknown function" to
"everything is cool" by adding a space to the file and saving it isn't
really much of a long-term solution.

I'm sending this to the list rather than filing a bug report to see if
anyone has experienced anything like this or has any thoughts.  If
there's any further information I can give to help narrow down the
source of the problem, I'm happy to do so.

Adrian

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to