I cannot see how this can occur. Any ideas from anyone else. There are no loops. It is possible that this is internal to db.
Mark On Sat, 2005-05-07 at 18:41 -0400, Yaroslav Halchenko wrote: > I've found the reason and probably that would be benefitial to adjust > cfservd to don't get into such situation again: > > I had a leftover file > /tmp/__db.testDATABASEcache > > so strace revealed me infinite loop of > > 28731 stat64("/tmp/testDATABASEcache", 0xb7c57350) = -1 ENOENT (No such file > or directory) > 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, > 0644) = -1 EEXIST (File exists) > 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, > 0644) = -1 EEXIST (File exists) > 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, > 0644) = -1 EEXIST (File exists) > > > Version installed (debian unstable) > cfengine2 2.1.14-1 > > -- > Yarik > > > On Sat, May 07, 2005 at 11:50:59AM -0400, Yaroslav Halchenko wrote: > > Dear All, > > > Yesterday one of the users filled up /tmp on a main node with junk and it > > rendered > > cfengine unusable. First it reported > > > daemon.log:May 6 21:11:23 ravana cfservd[16657]: Couldn't open checksum > > database /tmp/testDATABASEcache > > daemon.log:May 6 21:11:23 ravana cfservd[16657]: db_open: No space left > > on device > > > and seems after that whenever any node connects to it - cfservd > > becomes extremely busy and then finally fails with next message being > > reported by the nodes > > > cfengine:node20: Received signal 13 (SIGPIPE) while doing [no_active_lock] > > cfengine:node20: Logical start time Fri May 6 23:51:10 2005 > > cfengine:node20: This sub-task started really at Fri May 6 23:51:10 2005 > > > or actually now for some reason without a node name > > > cfengine:: Received signal 13 (SIGPIPE) while doing [pre-lock-state] > > cfengine:: Logical start time Sat May 7 11:00:33 2005 > > cfengine:: This sub-task started really at Sat May 7 11:00:33 2005 > > > and then another stating refusal for copying > > > cfengine:: Transmission refused or failed statting > > /etc/cfengine/inputs/CVS/Repository > > Got: > > cfengine:: Received signal 13 (SIGPIPE) while doing > > [lock.cfagent_conf.node2.copy.copy_3343] > > cfengine:: Logical start time Sat May 7 04:30:29 2005 > > cfengine:: This sub-task started really at Sat May 7 04:30:29 2005 > > > I've tried restarting cfengine parts on both ends - doesn't help. > > running cfservd with -d2 gave next: while trying to run update script > > (copy /etc/cfengine/input files across the nodes into /etc/cfengine) > > > ---------------------------------------- > > ... > > Access privileges - match found > > cfservd: Host node2.ravana.rutgers.edu granted access to > > /etc/cfengine/inputs/CVS/Root > > Clocks were off by 0 > > StatFile(/etc/cfengine/inputs/CVS/Root) > > OK: type=0 > > mode=644 > > lmode=0 > > uid=0 > > gid=0 > > size=10 > > atime=1115477605 > > mtime=1067285389 > > Transaction Send[t 65][Packed text] > > Attempting to send 73 bytes > > SendSocketStream, sent 73 > > Transaction Send[t 3][Packed text] > > Attempting to send 11 bytes > > SendSocketStream, sent 11 > > RecvSocketStream(8) > > (Concatenated 8 from stream) > > Transaction Receive [t 51][] > > RecvSocketStream(51) > > (Concatenated 51 from stream) > > Received: [MD5 /etc/cfengine/inputs/CVS/Root] on socket 5 > > CompareLocalChecksums(/etc/cfengine/inputs/CVS/Root,MD5=05e8d918529f204488a626792c4f8a6f) > > ChecksumChanged: key /etc/cfengine/inputs/CVS/Root with data > > MD5=05e8d918529f204488a626792c4f8a6f > > > <At this point it stalls for a minute or two although cfservd running > > busy> > > > IPV4 address > > sockaddr_ntop(10.0.0.2) > > Obtained IP address of 10.0.0.2 on socket 7 from accept > > > FuzzyItemIn(LIST,10.0.0.2) > > Purging Old Connections... > > Done purging > > > FuzzyItemIn(LIST,10.0.0.2) > > cfservd: Denying repeated connection from 10.0.0.2 > > ---------------------------------------- > > > from client (cfagent) side it looks like > > > ---------------------------------------- > > Compare binary sums on ravana:/etc/cfengine/inputs/CVS/Root & > > /var/lib/cfengine2/inputs/CVS/Root > > Using network md5 checksum instead > > ChecksumFile(m,/var/lib/cfengine2/inputs/CVS/Root) > > Send digest of /var/lib/cfengine2/inputs/CVS/Root to server, > > MD5=05e8d918529f204488a626792c4f8a6f > > Transaction Send[t 51][Packed text] > > Attempting to send 59 bytes > > SendSocketStream, sent 59 > > RecvSocketStream(8) > > <STALLS HERE and I got bored waiting till it dies... may be it never > > dies this time> > > > ---------------------------------------- > > > So here are the questions: > > > 1. how to fix current situation? > > clearly there is something broken in a current state, so may be I can > > clean out cfengine state so as to start from a clean one - I wouldn't > > mind if it takes longer to run for the first time ;-) Sure I can > > completely reinstall and then it should work I believe but... > > > > 2. what would be a nice policy to enforce over /tmp so I don't > > remove anything valuable (like ssh-agent sockets and some other staff > > opened by running programs). I'm thinking about smth like files and > > directories large in size should be forbidden (>1M) if they are older > > than an hour. I'm not sure if I can discard data solely on age, so > > age+size sounds good to me.. _______________________________________________ Help-cfengine mailing list Help-cfengine@gnu.org http://lists.gnu.org/mailman/listinfo/help-cfengine