A NOTE has been added to this issue.
======================================================================
http://www.dbmail.org/mantis/view.php?id=361
======================================================================
Reported By: blake
Assigned To: paul
======================================================================
Project: DBMail
Issue ID: 361
Category: IMAP daemon
Reproducibility: always
Severity: minor
Priority: normal
Status: assigned
target:
======================================================================
Date Submitted: 09-Jun-06 01:36 CEST
Last Modified: 03-Jul-06 12:59 CEST
======================================================================
Summary: IMAP zombies after about a day.
Description:
After starting up dbmail-imapd, I start to see zombies after about a day's
use. I am the only regular user of the server, with the current
Thunderbird client. Usually when it happens, I'll notice some weird laggy
behavior on the client, check the server, and sure enough there will be
one or more zombies. Here is a level 5 trace from the last command a child
handled before becomign a zombie:
Jun 8 19:04:32 colo dbmail/imap4d[7313]: COMMAND: [6 UID fetch 1332440
(UID RFC822.SIZE BODY.PEEK[])]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[0]: 'fetch'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[1]: '1332440'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[2]: '('
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[3]: 'UID'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[4]: 'RFC822.SIZE'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[5]: 'BODY.PEEK'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[6]: '['
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[7]: ']'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[8]: ')'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Executing command uid...
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c,db_acl_has_right: checking
ACL [read_flag] for user [1] on mailbox [5]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], is that also [1]?
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], giving all rights
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[2] =
[UID]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 2] =
UID (returning 3)
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[3] =
[RFC822.SIZE]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 3] =
RFC822.SIZE (returning 4)
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[4] =
[BODY.PEEK]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_session_fetch_parse_octet_range: parse token
[)]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[7] =
[)]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imapcommands.c,_ic_fetch:
fetch_start [1332440] fetch_end [1332440]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag,
recent_flag, DATE_FORMAT(internal_date, '%Y-%m-%d %T'), rfcsize,
message_idnr FROM dbmail_messages msg, dbmail_physmessage pm WHERE pm.id =
msg.physmessage_id AND message_idnr BETWEEN '1332440' AND '1332440' AND
mailbox_idnr = '5' AND status IN ('0','1') ORDER BY message_idnr ASC]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [* 1981 FETCH (]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [RFC822.SIZE 5541]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [UID 1332440]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache message [1332440] filter
[1]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT physmessage_id FROM dbmail_messages WHERE message_idnr =
'1332440']
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT messageblk FROM dbmail_messageblks WHERE physmessage_id = '810148'
ORDER BY messageblk_idnr]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-message.c,_set_content_from_stream: parse message
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache size [1088]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [BODY[] {1088} ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [) ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [6 OK UID FETCH
completed ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Finished command uid [0]
Jun 8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_readln: error reading from
client
Jun 8 19:09:33 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: error
reading command -- bailing out
Jun 8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_printf: write to client socket
failed
FWIW: I also noticed this behavior with 2.0.10, but only used that for a
short time before upgrading to trunk.
======================================================================
Relationships ID Summary
----------------------------------------------------------------------
related to 0000370 waitpid() in ParentSigHander() function...
related to 0000363 Somtimes the count of grandchild proces...
======================================================================
----------------------------------------------------------------------
aaron - 09-Jun-06 17:57
----------------------------------------------------------------------
Please tell a little more about what's going on. Do all of the processes
zombie? Only the parent? Only the middle? Only the children?
Like this output from ps (trimmed a little)
parent:
root 9809 1 /usr/local/sbin/dbmail-lmtpd
middle:
nobody 9811 9809 /usr/local/sbin/dbmail-lmtpd
children:
nobody 9816 9811 /usr/local/sbin/dbmail-lmtpd
nobody 9818 9811 /usr/local/sbin/dbmail-lmtpd
----------------------------------------------------------------------
aaron - 09-Jun-06 18:09
----------------------------------------------------------------------
Ok, I think I found it. dbmail_imap_session_printf has a TRACE_FATAL when
"write to client socket failed". I changed them to TRACE_ERROR with return
code of -2 (unrecoverable errors) -- although callers clean up and die on
return code -1, too, I figure it might be useful down the road to know
what happened.
Please test current SVN.
Now that should send a SIGCHLD to the parent process letting it know that
the child died. For some reason that SIGCHLD is probably not getting
through, and the child process remains a zombie until its status is
reported to the parent. Search for 'zombie' on this page for some good
reading:
http://www.win.tue.nl/~aeb/linux/lk/lk-5.html
----------------------------------------------------------------------
aaron - 11-Jun-06 02:34
----------------------------------------------------------------------
I think I got this.
----------------------------------------------------------------------
aaron - 13-Jun-06 08:55
----------------------------------------------------------------------
Wait just a minute. child_unregister is called with a child process exits,
and it sets the STATE_WAIT flag so that the parent can reap the return
process. Except I can't find the code that looks for STATE_WAIT children!
In fact, reap_child should probably be recoded to use the same method as
child_unregister -- a variant that takes a pid as an argument and gives it
STATE_WAIT status. reap_child would kill the process, then mark the
scoreboard entry as STATE_WAIT. The next call to manage_spare_children (or
a new function, like manage_zombie_children ;-) would reap all STATE_WAIT
statuses.
----------------------------------------------------------------------
paul - 03-Jul-06 12:55
----------------------------------------------------------------------
I'm reopening this bug because it's still unresolved. I'm working on
cleaning up the pooling code.
----------------------------------------------------------------------
paul - 03-Jul-06 12:59
----------------------------------------------------------------------
I've committed a cleanup of the preforking code that seems to address this
issue. I've been doing some heavy load-tests on this code, and it now
appears to hold up to it's task.
Please test svn-trunk
Issue History
Date Modified Username Field Change
======================================================================
09-Jun-06 01:36 blake New Issue
09-Jun-06 17:57 aaron Note Added: 0001236
09-Jun-06 18:09 aaron Note Added: 0001237
11-Jun-06 02:34 aaron Status new => resolved
11-Jun-06 02:34 aaron Fixed in Version => SVN Trunk
11-Jun-06 02:34 aaron Resolution open => fixed
11-Jun-06 02:34 aaron Assigned To => aaron
11-Jun-06 02:34 aaron Note Added: 0001240
12-Jun-06 18:11 aaron Relationship added related to 0000363
13-Jun-06 08:55 aaron Note Added: 0001248
03-Jul-06 12:55 paul Note Added: 0001290
03-Jul-06 12:55 paul Resolution fixed => reopened
03-Jul-06 12:56 paul Assigned To aaron => paul
03-Jul-06 12:56 paul Status resolved => assigned
03-Jul-06 12:56 paul Relationship added related to 0000370
03-Jul-06 12:59 paul Note Added: 0001291
======================================================================