Hi!
I have noticed that cachelogd will start consuming 100% of CPU time *exactly* 5
minutes after the daemon is started. Gdb showed that reason for the load is select()
loop.
In source code, I could see:
----- cachelogd.c
<...code...>
tval.tv_sec = 300;
tval.tv_usec = 0;
<...code...>
while (1) {
fd_set msk;
msk=mask;
sel=select(32,&msk,0,0,&tval);
if(sel==0){
/* Time limit expired */
continue;
}
if(sel<1){
/* FIXME: add errno checking */
continue;
}
/* Check whether new client has connected */
if(FD_ISSET(ctl_sock,&msk)){
int cl,addrlen;
addrlen = sizeof(his_addr);
for(cl=0;cl<MAXCLIENT;cl++){
if(log_client[cl].fd==0){
fprintf(stderr,"%s Client #%d
connected\n",time_pid_info(),cl);
log_client[cl].fd=accept(ctl_sock, (struct sockaddr
*)&his_addr, &addrlen);
FD_SET(log_client[cl].fd,&mask);
log_client[cl].state=STATE_CMD;
log_client[cl].rbytes=sizeof(UDM_LOGD_CMD);
break;
}
}
}
<code...>
-----
Now... I might be talking complete rubish, but I hope someone will correct me :)
>From what I could find, there are 2 'ways' to use select()/accept(). One way is to
>accept(), then use select() later - select() has a timeout, and if nothing happens
>during that timeout period on a socket, select() returns 0
and some action can be performed (close the socket, or whatever - depending on needs).
In another situation, select() is used first, and accept() later (as in cachelogd.c).
But, select() is called with timeout NULL, which makes it 'block' until some input
comes in.
What happens right now in cachelogd (as much as I can see, but I'm not a programmer by
'definition', so... ;) is that cachelogd will be ok for 5 minutes (while select() is
actually sleeping), but once the timer reaches 0,
select() will start the flood. It can be checked in gdb as well. Something like:
-----
[root@emx sbin]# gdb ./cachelogd
<loading...>
(gdb) b 416
Breakpoint 1 at 0x80494a7: file cachelogd.c, line 416.
(gdb) r
Starting program: /opt/mnogosearch/sbin/./cachelogd
Wed 21 16:49:59 [21785] Open logs 0 0
Wed 21 16:49:59 [21785] cachelogd started. Accepting 128 connections.
Breakpoint 1, main (argc=1, argv=0xbffffce4) at cachelogd.c:416
warning: Source file is more recent than executable.
416 sel=select(32,&msk,0,0,&tval);
[This select() is the 1st one that gets executed, and tval.tv_sec is 300 at this
point.]
(gdb) c
Continuing.
[exactly 300 seconds later...]
Breakpoint 1, main (argc=1, argv=0xbffffce4) at cachelogd.c:416
416 sel=select(32,&msk,0,0,&tval);
(gdb) p tval.tv_sec
$1 = 0
(gdb) c
Continuing.
Breakpoint 1, main (argc=1, argv=0xbffffce4) at cachelogd.c:416
416 sel=select(32,&msk,0,0,&tval);
(gdb) c
Continuing.
Breakpoint 1, main (argc=1, argv=0xbffffce4) at cachelogd.c:416
416 sel=select(32,&msk,0,0,&tval);
(gdb) c
Continuing.
etc... (repeats forever)
-----
I can think of 3 possible ways to fix this. But I would *really* appreciate if someone
with more 'socket experience' gives the proper fix and possibly explains the real
issue here :)
1. Do something like:
if(sel==0){
tval.tv_sec = 300; /* reset the timer when it reaches 0 */
/* Time limit expired */
continue;
}
In this case, timer will get reset every time it reaches 0. Seems to work ok, no
'side-effects' noticed (tried for 30 mins and re-indexed few thousand pages)
2. Do something like:
instead of:
sel=select(32,&msk,0,0,&tval);
use:
sel=select(32,&msk,NULL,NULL,(struct timeval *)NULL); /* I prefer NULL
over 0 - just for 'aesthetic' purposes, sorry :) */
This *should* make select() "block" until there is actually something it can deal with
(new connection, etc). Seems to work ok, no 'side-effects' noticed (still running,
re-indexing 10,000 pages)
3. Rewrite this part using accept(), and then select()
Don't think it's really needed :)
I hope there is someone more experienced to check this out :)
Thanks.
--
Vanja Hrustic
The Relay Group
http://relaygroup.com
Technology Ahead of Time
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]