Issue #1812 has been updated by luke.

Category set to plumbing
Status changed from Unreviewed to Accepted
Assigned to set to luke
Priority changed from Normal to Urgent
Target version set to 0.24.7

I think it's pretty obvious where the failures are happening, so we just need 
to protect those.  I don't think examples are necessary, but thanks.

I'm assigning this to 0.24.7, but it's up to James as to whether he'd let it be 
merged in, once I fix it.
----------------------------------------
Bug #1812: YAML files corrupted on server (due to high load?)
http://projects.reductivelabs.com/issues/show/1812

Author: nigelk2
Status: Accepted
Priority: Urgent
Assigned to: luke
Category: plumbing
Target version: 0.24.7
Complexity: Unknown
Affected version: 0.24.6
Keywords: 


>From peter's mail to puppet-dev


> Hi
> 
> it looks like it can happen that a node-yaml for a certain node gets
> broken. I had this now already a small amount of times and every time
> only a few (2-3) nodes were affected.
> 
> So whats the actual problem?
> 
> Suddenly I find Log entries like:
> 
> Tue Dec 09 15:34:27 +0100 2008 Puppet (err): Could not read YAML data
> for node foobar: syntax error on line 11, col 14: `  xen_domains: "3"'
> 
> in the puppetmaster.log and the master can't compile the node -> the
> node therefore won't get newer manifests, however it looks like the node
> itself gets in a corrupetd state and is unable to apply a cached manifest.
> 
> I can fix this problem by deleting the yaml file of that certain node in
> $puppetmaster_dir/yaml/node/ .
> 
> It often looks like that the master had a high load when this corrupt
> occurs. However I couldn't yet find a way to reproduce it, but from
> discussion in IRC it looks like other people also have randomly this
> problem. Randomly as it's not always the same node that has this problem
> and randomly that it happens very rarely.
> 
> So this looks certainly like a bug. However I was unsure if the data I
> gathered until now might be sufficient to file a bug. As well as I was
> in this more something-happens-magically-situation I'd rather like to
> investigate a bit more and maybe even come up with a solution or at
> least with an idea for a solution.
> 
> It looks like the yaml data got broken, as it might have happen due to
> the highload that there have been problems during the transmission or
> writing. Deleting the corrupt YAML file fixes the problem and as far as
> I saw it doesn't have any impact on the next run of the node.
> After examining the logs on the master and the client, it looks like the
> problem first occurs on the master. During the time it happened the
> first time it might be reasonable that the master had a very high load.
> 
> A solution I thought of might be to simply delete the yaml file on the
> master. The client could then exit with an error (like the present one)
> and if it rerun the next time everything would be fine.
> But this might be not the right way to fix. As I can't yet see when the
> yaml file is transferred, nor what the actual impact it has on compiling
> the manifest etc. I mean we could also simply delete it and restart
> again the client-run procedure (if that is possible), so we can fix the
> problem within a client-run (maybe with a max retries of 3).
> Another option might be to check if the yaml data get stored correctly
> and if not and if the yaml in the memory is still correct rewrite it,
> otherwise request it again from the client.
> Another idea I had is that it might be a problem in the yaml lib of ruby
> or whatever.
> 
> So do you guys think if this is certainly a bug and what would be the
> best location to look for the actual problem and what might be the best
> solution for it?
> 
> Testing the solution would be very easy: simply corrupt the yaml file
> and see if puppet behaves the expected way.
> However I'm yet really unsure how to reproduce the actual cause.
> 
> thanks for additional ideas or information. If I have a more concrete
> idea what might be the actual source of the problem and what might be
> the best way to fix the problem I'm more confident to file a bug.
> 
> cheers pete

Corroborated by myself and Oliver Hookins


----------------------------------------
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://reductivelabs.com/redmine/my/account

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to