Hi,

On Mon, 2010-08-02 at 14:07 +0300, Jonathon Anderson wrote:
> I'm re-posting this because I'm not sure that it got through the first
> time.  If someone could at least echo back that this is reaching the
> list, I'd appreciate it.  (I'm new to the list.)

I don't know if your first message went through, but I confirm this one
did.

> Sometimes (with variable frequency) storeconfigs stores the wrong data
> in the fact_values table.  This has the end result that exported
> resources, when collected, have invalid configuration.
> 
> The most recent example: the "hostname" fact for one of our nodes got,
> in stead, the value that should have gone in the "processorcount"
> fact.  The had the end result that the node's nagios configuration
> started trying to monitor a host "8" rather than "cn19", and ssh keys
> for cn19 were collected at other nodes as "8,8.example.com <keytext>"
> in stead of "cn19,cn19.example.com <keytext>".  The hostname fact is
> the only destination that I've noticed the corrupted data in, but the
> source has been swapfree/swapsize, processor[n], operatingsystem,
> operatingsystemrelease, kernelrelease, and others.
> 
> I realize that I don't have much of a "simple, repeatable, minimal"
> test case here, but I've been trying to figure it out for months to no
> avail.  I had hoped that an upgrade to 2.6 would make this problem go
> away, but no:  we've just now experienced it again.  For the record,
> we've seen it since sometime in the 0.24.x branch (when we started
> using it).

So that's an "old" issue, not something introduced in the brand new 2.6.

> It might have something to do with an appropriately high load on
> storeconfigs.  I ran it for 2 days with nodes exporting data (but not
> collecting) to see if it would happen again, and I didn't notice any
> corruption.  Then, today, I enabled collection (e.g., ssh_known_hosts)
> on all (~138) hosts, and soon after found a corrupt nagios
> configuration.  (Then again, it might just be that it's more probably
> with more nodes doing the collection.)

Which seems logical.

> I've never seen the actual facter command return one of these bits of
> misplaced data: the furthest back I've been able to trace it is to the
> facts_values table.
> 
> We're using a single puppet master, with storeconfigs storing to a
> postgresql database on a different host from the puppet master host.
> Everything works in the majority of cases, but fails just often enough
> to make it really, really annoying.
> 
> Any help anyone can provide, including insight into where I might look
> to track down the cause even further, would be much appreciated.
> Thanks.

So, the real question is to be able to understand where does the issue
come. As I see it, the facts the node sends to the puppetmaster are
correct, otherwise the received catalog wouldn't apply correctly.
So the issue is, to my understanding, a pure storeconfig issue.

The first thing you should check is the version of active record or the
postgres lib you are using. Try to upgrade those, maybe the issue was
fixed (assuming the issue is not on the Puppet side).

Next, you should try to analyse where the issue came, by having a look
to the SQL queries active record generated:
1) clean up the mess so that you start with a good database
2) activate on your master the active_record log (set
rails_loglevel=debug and railslog=/path/to/rails.log)
3) let it run until you notice the issue
4) read the rails log to find the culprit sql request, maybe that could
give you more information. At least we'll know what it tries to save.

Then, I'd add debug statements to the puppetmaster (check
lib/puppet/rails/host.rb especially the merge_facts method). By
correlating this debug information and the query log, you might be able
to notice a pattern or at least find if the problem comes from an issue
in the data Puppet has, or if the issue is created in the AR layer.

You should also file a bug report with all the information you'll find.
Hope that helps,
-- 
Brice Figureau
Follow the latest Puppet Community evolutions on www.planetpuppet.org!

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to