A NOTE has been added to this issue.
======================================================================
http://www.dbmail.org/mantis/view.php?id=361
======================================================================
Reported By: blake
Assigned To: paul
======================================================================
Project: DBMail
Issue ID: 361
Category: IMAP daemon
Reproducibility: always
Severity: minor
Priority: normal
Status: assigned
target:
======================================================================
Date Submitted: 09-Jun-06 01:36 CEST
Last Modified: 18-Jul-06 13:07 CEST
======================================================================
Summary: IMAP zombies after about a day.
Description:
After starting up dbmail-imapd, I start to see zombies after about a day's
use. I am the only regular user of the server, with the current
Thunderbird client. Usually when it happens, I'll notice some weird laggy
behavior on the client, check the server, and sure enough there will be
one or more zombies. Here is a level 5 trace from the last command a child
handled before becomign a zombie:
Jun 8 19:04:32 colo dbmail/imap4d[7313]: COMMAND: [6 UID fetch 1332440
(UID RFC822.SIZE BODY.PEEK[])]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[0]: 'fetch'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[1]: '1332440'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[2]: '('
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[3]: 'UID'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[4]: 'RFC822.SIZE'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[5]: 'BODY.PEEK'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[6]: '['
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[7]: ']'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[8]: ')'
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Executing command uid...
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c,db_acl_has_right: checking
ACL [read_flag] for user [1] on mailbox [5]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], is that also [1]?
Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], giving all rights
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[2] =
[UID]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 2] =
UID (returning 3)
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[3] =
[RFC822.SIZE]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 3] =
RFC822.SIZE (returning 4)
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[4] =
[BODY.PEEK]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_session_fetch_parse_octet_range: parse token
[)]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[7] =
[)]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imapcommands.c,_ic_fetch:
fetch_start [1332440] fetch_end [1332440]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag,
recent_flag, DATE_FORMAT(internal_date, '%Y-%m-%d %T'), rfcsize,
message_idnr FROM dbmail_messages msg, dbmail_physmessage pm WHERE pm.id =
msg.physmessage_id AND message_idnr BETWEEN '1332440' AND '1332440' AND
mailbox_idnr = '5' AND status IN ('0','1') ORDER BY message_idnr ASC]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [* 1981 FETCH (]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [RFC822.SIZE 5541]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [UID 1332440]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache message [1332440] filter
[1]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT physmessage_id FROM dbmail_messages WHERE message_idnr =
'1332440']
Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT messageblk FROM dbmail_messageblks WHERE physmessage_id = '810148'
ORDER BY messageblk_idnr]
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-message.c,_set_content_from_stream: parse message
Jun 8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache size [1088]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [BODY[] {1088} ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [) ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [6 OK UID FETCH
completed ]
Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Finished command uid [0]
Jun 8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_readln: error reading from
client
Jun 8 19:09:33 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: error
reading command -- bailing out
Jun 8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_printf: write to client socket
failed
FWIW: I also noticed this behavior with 2.0.10, but only used that for a
short time before upgrading to trunk.
======================================================================
Relationships ID Summary
----------------------------------------------------------------------
related to 0000370 waitpid() in ParentSigHander() function...
related to 0000363 Somtimes the count of grandchild proces...
child of 0000303 Check for signal-safe calls in the pool...
======================================================================
----------------------------------------------------------------------
aaron - 09-Jun-06 17:57
----------------------------------------------------------------------
Please tell a little more about what's going on. Do all of the processes
zombie? Only the parent? Only the middle? Only the children?
Like this output from ps (trimmed a little)
parent:
root 9809 1 /usr/local/sbin/dbmail-lmtpd
middle:
nobody 9811 9809 /usr/local/sbin/dbmail-lmtpd
children:
nobody 9816 9811 /usr/local/sbin/dbmail-lmtpd
nobody 9818 9811 /usr/local/sbin/dbmail-lmtpd
----------------------------------------------------------------------
aaron - 09-Jun-06 18:09
----------------------------------------------------------------------
Ok, I think I found it. dbmail_imap_session_printf has a TRACE_FATAL when
"write to client socket failed". I changed them to TRACE_ERROR with return
code of -2 (unrecoverable errors) -- although callers clean up and die on
return code -1, too, I figure it might be useful down the road to know
what happened.
Please test current SVN.
Now that should send a SIGCHLD to the parent process letting it know that
the child died. For some reason that SIGCHLD is probably not getting
through, and the child process remains a zombie until its status is
reported to the parent. Search for 'zombie' on this page for some good
reading:
http://www.win.tue.nl/~aeb/linux/lk/lk-5.html
----------------------------------------------------------------------
aaron - 11-Jun-06 02:34
----------------------------------------------------------------------
I think I got this.
----------------------------------------------------------------------
aaron - 13-Jun-06 08:55
----------------------------------------------------------------------
Wait just a minute. child_unregister is called with a child process exits,
and it sets the STATE_WAIT flag so that the parent can reap the return
process. Except I can't find the code that looks for STATE_WAIT children!
In fact, reap_child should probably be recoded to use the same method as
child_unregister -- a variant that takes a pid as an argument and gives it
STATE_WAIT status. reap_child would kill the process, then mark the
scoreboard entry as STATE_WAIT. The next call to manage_spare_children (or
a new function, like manage_zombie_children ;-) would reap all STATE_WAIT
statuses.
----------------------------------------------------------------------
paul - 03-Jul-06 12:55
----------------------------------------------------------------------
I'm reopening this bug because it's still unresolved. I'm working on
cleaning up the pooling code.
----------------------------------------------------------------------
paul - 03-Jul-06 12:59
----------------------------------------------------------------------
I've committed a cleanup of the preforking code that seems to address this
issue. I've been doing some heavy load-tests on this code, and it now
appears to hold up to it's task.
Please test svn-trunk
----------------------------------------------------------------------
ryo - 04-Jul-06 11:14
----------------------------------------------------------------------
I have tested svn 2192. But I had any zombies of dbmail-pop3d.
I could know by using strace command that the middle process of
dbmail-pop3d stopped at the futex() as follows.
[EMAIL PROTECTED] ~]# strace -p 13841
Process 13841 attached - interrupt to quit
futex(0xc252ac, FUTEX_WAIT, 2, NULL
And gdb trace is as follows.
[EMAIL PROTECTED] ~]# gdb
GNU gdb Red Hat Linux (6.1post-1.20040607.43rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux-gnu".
(gdb) attach 13841
Attaching to process 13841
: (snip)
0x00bd5bc9 in __lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6
(gdb) where
http://www.dbmail.org/mantis/view.php?id=0 0x00bd5bc9 in
__lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6
http://www.dbmail.org/mantis/view.php?id=1 0x00bc722b in _L_mutex_lock_541 ()
from /lib/tls/i686/libc.so.6
http://www.dbmail.org/mantis/view.php?id=2 0x00000000 in ?? ()
I think maybe that the cause of this problem is that the signal-unsafe
(printf(), malloc()...)function is called in the
signal handler.
For example, the trace() function in dbmail calls g_strdup_vprintf() and
fprintf().
The signal-safe functions are defined in SUSv3.
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03
I am going to test the dbmail without signal-unsafe function.
----------------------------------------------------------------------
ryo - 18-Jul-06 07:02
----------------------------------------------------------------------
I continued testing the svn 2192 including signal-unsafe coding.
Then the futex deadlock (like 0001292) occurred in not only middle
process but also child process.
I think that the cause of this deadlock is to call client_close()
and disconnect_all()... in serverchild.c:active_child_sig_handle.
(Becase the free() and fclose(), are not signal-safe functions.)
"Secure Programming for Linux and Unix HOWTO" describes as follows.
- Where possible, have your signal handlers unconditionally set
a specific flag and do nothing else.
Secure Programming for Linux and Unix HOWTO:
http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/signals.html
I created the patch in consideration of this document. Please see
the attached file(dbmail-signalsafe.patch).
I patched this patch to svn 2192 and I am continuing a stress test
for four days. So, I have no zombie and no deadlock yet.
The digest of this patch is as follows.
- only set flags in signal handlers.
- The signal processings are performed at the outside of signal
handlers.
- The data type of flags is volatile sig_atomic_t.
- nodaemonize mode have no signal processing yet. (any idea?)
Please check my patch(dbmail-signalsafe.patch).
----------------------------------------------------------------------
paul - 18-Jul-06 13:07
----------------------------------------------------------------------
patch accepted. Nice work Ryo!
Issue History
Date Modified Username Field Change
======================================================================
09-Jun-06 01:36 blake New Issue
09-Jun-06 17:57 aaron Note Added: 0001236
09-Jun-06 18:09 aaron Note Added: 0001237
11-Jun-06 02:34 aaron Status new => resolved
11-Jun-06 02:34 aaron Fixed in Version => SVN Trunk
11-Jun-06 02:34 aaron Resolution open => fixed
11-Jun-06 02:34 aaron Assigned To => aaron
11-Jun-06 02:34 aaron Note Added: 0001240
12-Jun-06 18:11 aaron Relationship added related to 0000363
13-Jun-06 08:55 aaron Note Added: 0001248
03-Jul-06 12:55 paul Note Added: 0001290
03-Jul-06 12:55 paul Resolution fixed => reopened
03-Jul-06 12:56 paul Assigned To aaron => paul
03-Jul-06 12:56 paul Status resolved => assigned
03-Jul-06 12:56 paul Relationship added related to 0000370
03-Jul-06 12:59 paul Note Added: 0001291
04-Jul-06 11:14 ryo Note Added: 0001292
04-Jul-06 15:46 paul Relationship added child of 0000303
18-Jul-06 07:02 ryo Note Added: 0001304
18-Jul-06 07:02 ryo File Added: dbmail-signalsafe.patch
18-Jul-06 13:07 paul Note Added: 0001305
======================================================================