Re: cfservd thrushes, nodes fail to get anything

Yaroslav Halchenko Sat, 07 May 2005 15:46:49 -0700

I've found the reason and probably that would be benefitial to adjust
cfservd to don't get into such situation again:


I had a leftover file 
/tmp/__db.testDATABASEcache

so strace revealed me infinite loop of

28731 stat64("/tmp/testDATABASEcache", 0xb7c57350) = -1 ENOENT (No such file or 
directory)
28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
0644) = -1 EEXIST (File exists)
28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
0644) = -1 EEXIST (File exists)
28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
0644) = -1 EEXIST (File exists)


Version installed  (debian unstable)
cfengine2              2.1.14-1 

--
Yarik


On Sat, May 07, 2005 at 11:50:59AM -0400, Yaroslav Halchenko wrote:
> Dear All,

> Yesterday one of the users filled up /tmp on a main node with junk and it 
> rendered
> cfengine unusable. First it reported

> daemon.log:May  6 21:11:23 ravana cfservd[16657]:  Couldn't open checksum 
> database /tmp/testDATABASEcache 
> daemon.log:May  6 21:11:23 ravana cfservd[16657]:  db_open: No space left on 
> device

> and seems after that whenever any node connects to it - cfservd
> becomes extremely busy and then finally fails with next message being
> reported by the nodes

> cfengine:node20: Received signal 13 (SIGPIPE) while doing [no_active_lock]
> cfengine:node20: Logical start time Fri May  6 23:51:10 2005
> cfengine:node20: This sub-task started really at Fri May  6 23:51:10 2005

> or actually now for some reason without a node name

> cfengine:: Received signal 13 (SIGPIPE) while doing [pre-lock-state]
> cfengine:: Logical start time Sat May  7 11:00:33 2005
> cfengine:: This sub-task started really at Sat May  7 11:00:33 2005

> and then another stating refusal for copying

> cfengine:: Transmission refused or failed statting 
> /etc/cfengine/inputs/CVS/Repository
> Got:
> cfengine:: Received signal 13 (SIGPIPE) while doing 
> [lock.cfagent_conf.node2.copy.copy_3343]
> cfengine:: Logical start time Sat May  7 04:30:29 2005
> cfengine:: This sub-task started really at Sat May  7 04:30:29 2005

> I've tried restarting cfengine parts on both ends - doesn't help.
> running cfservd with -d2 gave next: while trying to run update script
> (copy /etc/cfengine/input files across the nodes into /etc/cfengine)

> ----------------------------------------
> ...
> Access privileges - match found
> cfservd: Host node2.ravana.rutgers.edu granted access to 
> /etc/cfengine/inputs/CVS/Root
> Clocks were off by 0
> StatFile(/etc/cfengine/inputs/CVS/Root)
> OK: type=0
>  mode=644
>  lmode=0
>  uid=0
>  gid=0
>  size=10
>  atime=1115477605
>  mtime=1067285389
> Transaction Send[t 65][Packed text]
> Attempting to send 73 bytes
> SendSocketStream, sent 73
> Transaction Send[t 3][Packed text]
> Attempting to send 11 bytes
> SendSocketStream, sent 11
> RecvSocketStream(8)
>     (Concatenated 8 from stream)
> Transaction Receive [t 51][]
> RecvSocketStream(51)
>     (Concatenated 51 from stream)
> Received: [MD5 /etc/cfengine/inputs/CVS/Root] on socket 5
> CompareLocalChecksums(/etc/cfengine/inputs/CVS/Root,MD5=05e8d918529f204488a626792c4f8a6f)
> ChecksumChanged: key /etc/cfengine/inputs/CVS/Root with data 
> MD5=05e8d918529f204488a626792c4f8a6f

> <At this point it stalls for a minute or two although cfservd running
> busy>

> IPV4 address
> sockaddr_ntop(10.0.0.2)
> Obtained IP address of 10.0.0.2 on socket 7 from accept

> FuzzyItemIn(LIST,10.0.0.2)
> Purging Old Connections...
> Done purging

> FuzzyItemIn(LIST,10.0.0.2)
> cfservd: Denying repeated connection from 10.0.0.2
> ----------------------------------------

> from client (cfagent) side it looks like

> ----------------------------------------
> Compare binary sums on ravana:/etc/cfengine/inputs/CVS/Root & 
> /var/lib/cfengine2/inputs/CVS/Root
> Using network md5 checksum instead
> ChecksumFile(m,/var/lib/cfengine2/inputs/CVS/Root)
> Send digest of /var/lib/cfengine2/inputs/CVS/Root to server, 
> MD5=05e8d918529f204488a626792c4f8a6f
> Transaction Send[t 51][Packed text]
> Attempting to send 59 bytes
> SendSocketStream, sent 59
> RecvSocketStream(8)
> <STALLS HERE and I got bored waiting till it dies... may be it never
> dies this time>

> ----------------------------------------

> So here are the questions:

> 1. how to fix current situation?  
>    clearly there is something broken in a current state, so may be I can
>    clean out cfengine state so as to start from a clean one - I wouldn't
>    mind if it takes longer to run for the first time ;-) Sure I can
>    completely reinstall and then it should work I believe but...


> 2. what would be a nice policy to enforce over /tmp so I don't
> remove anything valuable (like ssh-agent sockets and some other staff
> opened by running programs). I'm thinking about smth like files and
> directories large in size should be forbidden (>1M) if they are older
> than an hour. I'm not sure if I can discard data solely on age, so
> age+size sounds good to me..
-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07105
Student  Ph.D. @ CS Dept. NJIT


_______________________________________________
Help-cfengine mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/help-cfengine

Re: cfservd thrushes, nodes fail to get anything

Reply via email to