Re: has anybody seen worker segfaults?
Jeff Trawick [EMAIL PROTECTED] writes: t0 we need to fork() a new child for some reason t1 we get the graceful restart prod on the pod BEFORE the start_threads() thread has gotten dispatched and initialized worker_queue t2 we call signal_workers() which tries to use a NULL worker_queue and we segfault One possible fix for this is to initialize worker_queue in child_main() before creating the start_threads() thread so that there is no question that the init has been done before we try to use it. But there seem to be other timing issues as well. Maybe we miss our chance to apr_thread_join() a worker thread right before it has been created. A fix that would handle that as well as the first problem would be to join the start_threads() thread before trying to process a restart request. After trying this the segfaults go away but we still have some process bailing out prematurely (still get the accept mutex error, still get a lot of dropped connections even though nobody is segfaulting). -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
Aaron Bannert [EMAIL PROTECTED] writes: On Thu, Feb 21, 2002 at 06:02:26PM -0500, Jeff Trawick wrote: I just tried to hit this on my Solaris x86 box with no luck. I did 200,000 simple requests with a SIGUSR1 sent to the parent ever 2 seconds. No segfaults or failed pthread calls, but at the end of the 200,000 requests I had 177 stranded connections (client still waiting on a response on 177 connections). I also had 177 connections to the httpd port in FIN_WAIT_2. Weird... It is my understanding that FIN_WAIT_2 happens on the client side after the client does an active close but before the server does a close. If the server process died I'd expect the OS to send a FIN back to the client, so perhaps you have child processes that never died off from the previous generation? On that theory, you would have 177 connections in CLOSE_WAIT on the server. yep, of course you're absolutely right... I started a document describing these issues at http://www.apache.org/~trawick/http_tcp.html There are some windows at the end of the exchange where Linux netstat (client side) and Solaris 8 netstat (server side) did not display information about the connection, even though it still existed. Also there are possible deviations in the order in which the exchange can occur which would result in different TCP connection states. -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
On Fri, Feb 22, 2002 at 02:43:04PM -0500, Jeff Trawick wrote: It is my understanding that FIN_WAIT_2 happens on the client side after the client does an active close but before the server does a close. If the server process died I'd expect the OS to send a FIN back to the client, so perhaps you have child processes that never died off from the previous generation? On that theory, you would have 177 connections in CLOSE_WAIT on the server. yep, of course you're absolutely right... If you do indeed have connections hanging around on the server stuck in CLOSE_WAIT after a graceful, then we probably have a bug in our graceful restart code. Was this the case, or did they eventually go away? -aaron
Re: has anybody seen worker segfaults?
Jeff Trawick [EMAIL PROTECTED] writes: For some time some (but after 2.0.32), some tests I run have been segfaulting around the time of a graceful restart. Has anybody else seen something like this? a new summary: Here are some failure scenarios found when doing a graceful restart while there were active connections: a) 2 listening sockets on Linux, where we need an accept mutex: [emerg] (43)Identifier removed: apr_proc_mutex_lock failed. Attempting to shutdown process gracefully. [emerg] (22)Invalid argument: apr_proc_mutex_unlock failed. Attempting to shutdown process gracefully. The server process exits and existing connections are dropped. b) 1 listening socket in Linux, where we don't need an accept mutex (intermittent failure): [notice] child pid 18314 exit signal Segmentation fault (11) c) 1 listening socket on AIX, where we don't need an accept mutex (intermittent failure): [crit] (22)A system call received a parameter that is not valid.: ap_queue_push failed d) Dale Ghent hit a segfault on Solaris 8 in ap_queue_interrupt_all() (NULL parameter passed in). I would guess that the cause is the patch to stop using signals (p=0.75) or the patch to reuse transaction pools (p=0.10). -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
Aaron Bannert [EMAIL PROTECTED] writes: On Fri, Feb 22, 2002 at 02:43:04PM -0500, Jeff Trawick wrote: It is my understanding that FIN_WAIT_2 happens on the client side after the client does an active close but before the server does a close. If the server process died I'd expect the OS to send a FIN back to the client, so perhaps you have child processes that never died off from the previous generation? On that theory, you would have 177 connections in CLOSE_WAIT on the server. yep, of course you're absolutely right... If you do indeed have connections hanging around on the server stuck in CLOSE_WAIT after a graceful, then we probably have a bug in our graceful restart code. Was this the case, or did they eventually go away? For the moment I'm assuming that it is a bug in the tool I was using. -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
Jeff Trawick [EMAIL PROTECTED] writes: b) 1 listening socket in Linux, where we don't need an accept mutex (intermittent failure): [notice] child pid 18314 exit signal Segmentation fault (11) ... d) Dale Ghent hit a segfault on Solaris 8 in ap_queue_interrupt_all() (NULL parameter passed in). At least some segfaults I'm seeing on Linux match the Solaris symptom. The bug is that we're calling ap_queue_interrupt_all() before initializing worker_queue. t0 we need to fork() a new child for some reason t1 we get the graceful restart prod on the pod BEFORE the start_threads() thread has gotten dispatched and initialized worker_queue t2 we call signal_workers() which tries to use a NULL worker_queue and we segfault One possible fix for this is to initialize worker_queue in child_main() before creating the start_threads() thread so that there is no question that the init has been done before we try to use it. But there seem to be other timing issues as well. Maybe we miss our chance to apr_thread_join() a worker thread right before it has been created. A fix that would handle that as well as the first problem would be to join the start_threads() thread before trying to process a restart request. -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
Jeff Trawick [EMAIL PROTECTED] writes: For some time some (but after 2.0.32), some tests I run have been segfaulting around the time of a graceful restart. Has anybody else seen something like this? [Tue Feb 19 10:31:43 2002] [notice] child pid 5367 exit signal Segmentation fault (11) [Tue Feb 19 10:31:43 2002] [notice] SIGUSR1 received. Doing graceful restart [Tue Feb 19 10:31:46 2002] [info] mod_unique_id: using ip addr 24.163.40.92 [Tue Feb 19 10:31:47 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- resuming normal operations [Tue Feb 19 10:31:47 2002] [info] Server built: Feb 19 2002 10:23:24 [Tue Feb 19 10:31:47 2002] [notice] child pid 5483 exit signal Segmentation fault (11) [Tue Feb 19 10:31:47 2002] [notice] SIGUSR1 received. Doing graceful restart [Tue Feb 19 10:31:50 2002] [info] mod_unique_id: using ip addr 24.163.40.92 [Tue Feb 19 10:31:51 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- resuming normal operations [Tue Feb 19 10:31:51 2002] [info] Server built: Feb 19 2002 10:23:24 [Tue Feb 19 10:31:51 2002] [notice] child pid 5579 exit signal Segmentation fault (11) [Tue Feb 19 10:31:51 2002] [notice] SIGUSR1 received. Doing graceful restart I just tried to hit this on my Solaris x86 box with no luck. I did 200,000 simple requests with a SIGUSR1 sent to the parent ever 2 seconds. No segfaults or failed pthread calls, but at the end of the 200,000 requests I had 177 stranded connections (client still waiting on a response on 177 connections). I also had 177 connections to the httpd port in FIN_WAIT_2. Weird... -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
On Thu, Feb 21, 2002 at 06:02:26PM -0500, Jeff Trawick wrote: I just tried to hit this on my Solaris x86 box with no luck. I did 200,000 simple requests with a SIGUSR1 sent to the parent ever 2 seconds. No segfaults or failed pthread calls, but at the end of the 200,000 requests I had 177 stranded connections (client still waiting on a response on 177 connections). I also had 177 connections to the httpd port in FIN_WAIT_2. Weird... It is my understanding that FIN_WAIT_2 happens on the client side after the client does an active close but before the server does a close. If the server process died I'd expect the OS to send a FIN back to the client, so perhaps you have child processes that never died off from the previous generation? On that theory, you would have 177 connections in CLOSE_WAIT on the server. -aaron
Re: has anybody seen worker segfaults?
Brian Pane [EMAIL PROTECTED] writes: Jeff Trawick wrote: Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 In your AIX test environment, can you catch this error case in action by putting breakpoints at the two lines in ap_queue_push() where it's about to return -1? int ap_queue_push(fd_queue_t *queue, apr_socket_t *sd, apr_pool_t *p, apr_pool_t **recycled_pool) { /*...*/ if (apr_thread_mutex_lock(queue-one_big_mutex) != APR_SUCCESS) { return FD_QUEUE_FAILURE; This is returning EINVAL. That might help isolate the source of the problem. My two guesses right now are: - pool lifetime problem, or - pthread library problem Pool lifetime is by far the most likely suspect out of these two. Consider that we've been happily obtaining/releasing that mutex all along until restart time, when a process that is dieing hits that problem. -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
On Wed, Feb 20, 2002 at 11:44:14AM -0500, Dale Ghent wrote: FWIW, I compiled up the latest CVS HEAD as of last night (just after the CAS stuff was re-added back into APR) on Solaris 8+sendfile, hit the server up pretty hard with ab, fetching 9.5k and 608k jpeg files thousands of times with 50 concurrent non-keepalive connections, and there was no core files or any errors reported in error_log. This was done with the worker mpm. I'll say 'congratulations!' to that! Thanks! Now try it again and hit bin/apachectl graceful in the middle of your test [a few times]. :) -aaron
Re: has anybody seen worker segfaults?
On Wed, 20 Feb 2002, Aaron Bannert wrote: | Now try it again and hit bin/apachectl graceful in the middle of your | test [a few times]. :) Got a core with this. ab reported 159 (out of 2000) requests failed (in the Length: category). Here's a bt: #0 ap_queue_interrupt_all (queue=0x0) at fdqueue.c:219 219 if (apr_thread_mutex_lock(queue-one_big_mutex) != APR_SUCCESS) { (gdb) where full #0 ap_queue_interrupt_all (queue=0x0) at fdqueue.c:219 No locals. #1 0x7 in child_main (child_num_arg=1) at worker.c:998 threads = (apr_thread_t **) 0x1bb7a8 i = 826368 rv = 1 ts = (thread_starter *) 0xc9c00 thread_attr = (apr_threadattr_t *) 0x10bdb0 start_thread_id = (apr_thread_t *) 0x10bdc0 #2 0x7469c in make_child (s=0x1777b8, slot=1) at worker.c:1071 pid = 0 #3 0x749a0 in perform_idle_server_maintenance () at worker.c:1233 i = 1 j = 0 idle_thread_count = 23 ps = (process_score *) 0xfee40060 free_length = 1 totally_free_length = 831488 free_slots = {1, 0, 1, 0, 0, 472476, 32769, 0, 1, -4261240, 1, 831488, 0, 6, 5, 813056, -4260988, -4260992, -4260984, 867320, 813056, 0, -4261096, 477948, 10, 104, -4261096, 473240, 31, 104, -4261064, 6} last_non_dead = 5 total_non_dead = 6 #4 0x74bd0 in server_main_loop (remaining_children_to_start=0) at worker.c:1288 child_slot = 1 exitwhy = APR_PROC_EXIT status = 0 processed_status = 0 pid = {pid = -1, in = 0xffbefc04, out = 0x99280, err = 0x110e70} i = 25 #5 0x74e70 in ap_mpm_run (_pconf=0x2, plog=0x113cf8, s=0xc6800) at worker.c:1413 remaining_children_to_start = 2 rv = 827392 #6 0x7ad38 in main (argc=1537976, argv=0xd1c70) at main.c:500 c = 0 '\000' configtestonly = 0 confname = 0xae7f8 conf/httpd.conf def_server_root = 0xae7e8 /local/apache2 process = (process_rec *) 0xd1c70 server_conf = (server_rec *) 0x1777b8 pglobal = (apr_pool_t *) 0xc6800 pconf = (apr_pool_t *) 0xd3bf8 plog = (apr_pool_t *) 0x113cf8 ptemp = (apr_pool_t *) 0x10bcd8 pcommands = (apr_pool_t *) 0x111cf0 opt = (apr_getopt_t *) 0x111d88 rv = 1129720 mod = (module **) 0x1777b8 optarg = 0x3 Address 0x3 out of bounds
Re: has anybody seen worker segfaults?
On Wed, Feb 20, 2002 at 11:59:15AM -0500, Dale Ghent wrote: On Wed, 20 Feb 2002, Aaron Bannert wrote: | Now try it again and hit bin/apachectl graceful in the middle of your | test [a few times]. :) Got a core with this. ab reported 159 (out of 2000) requests failed (in the Length: category). Here's a bt: Looks like a pool lifetime problems to me. I'll look into this soonish unless someone beats me to it. -a
PHP4 was Re: has anybody seen worker segfaults?
On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin
Re: PHP4 was Re: has anybody seen worker segfaults?
On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin Err, just to make it clear, you need the latest version of PHP from their CVS repository. We made changes in the .31 timeframe to our input filters and they haven't done a release since then. DougM committed the relevant fixes to PHP's repository right after we changed it here. -- justin
Re: PHP4 was Re: has anybody seen worker segfaults?
NP. I am using the CVS as of last night. That's why I'm writing, and that's why I said 4.1.1, I guess it should've been 4.1.1 :) On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin Err, just to make it clear, you need the latest version of PHP from their CVS repository. We made changes in the .31 timeframe to our input filters and they haven't done a release since then. DougM committed the relevant fixes to PHP's repository right after we changed it here. -- justin -- Austin Gonyou Systems Architect, CCNA Coremetrics, Inc. Phone: 512-698-7250 email: [EMAIL PROTECTED] It is the part of a good shepherd to shear his flock, not to skin it. Latin Proverb
Re: PHP4 was Re: has anybody seen worker segfaults?
FYI. Not an compilation problem, HTTPD just doesn't do anything, but doesn't write a log either, and only 1 process is started. On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin Err, just to make it clear, you need the latest version of PHP from their CVS repository. We made changes in the .31 timeframe to our input filters and they haven't done a release since then. DougM committed the relevant fixes to PHP's repository right after we changed it here. -- justin -- Austin Gonyou Systems Architect, CCNA Coremetrics, Inc. Phone: 512-698-7250 email: [EMAIL PROTECTED] It is the part of a good shepherd to shear his flock, not to skin it. Latin Proverb
RE: PHP4 was Re: has anybody seen worker segfaults?
Austin, I was working on this some time back, and I don't believe it's a Apache problem (it might be a better idea to move it to the PHP mailing list) The problem occurs because of the way the PHP context is handled in php_output_filter (sapi_apache2.c).. Here's something that I observed : The function php_output_filter() is called 2 times because of the way the output data is handled by the filter.. - the first time around some sort of initialization is done. - during the second round, the data is sent out - then i don't know what happens (possibly wrong termination of the o/p brigade), and the php_output_filter is called for the third time.. The filter knows about such a thing happening, but the code is not written properly to handle the situation.. The PHP context (ctx) is corrupted, and the PHP module bombs - you end up seeing only the parent apache process. (Due to my limited PHP knowledge) I introduced the following snippet just before the ap_save_brigade(...) block, and it seems to bring up apache if ((ctx-state 0) || (ctx-state 2)) { ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, NULL, PHP : Unrecognized state!); return 0; } This is a workaround and not a resolution/fix.. I'd appreciate if anybody could post a fix for this.. Thanks, -Madhu -Original Message- From: Austin Gonyou [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 20, 2002 10:56 AM To: [EMAIL PROTECTED] Subject: Re: PHP4 was Re: has anybody seen worker segfaults? FYI. Not an compilation problem, HTTPD just doesn't do anything, but doesn't write a log either, and only 1 process is started. On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin Err, just to make it clear, you need the latest version of PHP from their CVS repository. We made changes in the .31 timeframe to our input filters and they haven't done a release since then. DougM committed the relevant fixes to PHP's repository right after we changed it here. -- justin -- Austin Gonyou Systems Architect, CCNA Coremetrics, Inc. Phone: 512-698-7250 email: [EMAIL PROTECTED] It is the part of a good shepherd to shear his flock, not to skin it. Latin Proverb
RE: PHP4 was Re: has anybody seen worker segfaults?
NP. I'll move it over to that list then. I just wanted to bring it up here first since I've no output from apache at all. Strace didn't help much either(no this time). Thanks for the info Madhu. If I get an actual fix/resolution for this, I'll be sure to let everyone know, so we've closure. On Wed, 2002-02-20 at 16:13, MATHIHALLI,MADHUSUDAN (HP-Cupertino,ex1) wrote: Austin, I was working on this some time back, and I don't believe it's a Apache problem (it might be a better idea to move it to the PHP mailing list) The problem occurs because of the way the PHP context is handled in php_output_filter (sapi_apache2.c).. Here's something that I observed : The function php_output_filter() is called 2 times because of the way the output data is handled by the filter.. - the first time around some sort of initialization is done. - during the second round, the data is sent out - then i don't know what happens (possibly wrong termination of the o/p brigade), and the php_output_filter is called for the third time.. The filter knows about such a thing happening, but the code is not written properly to handle the situation.. The PHP context (ctx) is corrupted, and the PHP module bombs - you end up seeing only the parent apache process. (Due to my limited PHP knowledge) I introduced the following snippet just before the ap_save_brigade(...) block, and it seems to bring up apache if ((ctx-state 0) || (ctx-state 2)) { ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, NULL, PHP : Unrecognized state!); return 0; } This is a workaround and not a resolution/fix.. I'd appreciate if anybody could post a fix for this.. Thanks, -Madhu -Original Message- From: Austin Gonyou [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 20, 2002 10:56 AM To: [EMAIL PROTECTED] Subject: Re: PHP4 was Re: has anybody seen worker segfaults? FYI. Not an compilation problem, HTTPD just doesn't do anything, but doesn't write a log either, and only 1 process is started. On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote: On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote: PHP4.1.1 or not working? You have to have the version from CVS in order to get it to compile. -- justin Err, just to make it clear, you need the latest version of PHP from their CVS repository. We made changes in the .31 timeframe to our input filters and they haven't done a release since then. DougM committed the relevant fixes to PHP's repository right after we changed it here. -- justin -- Austin Gonyou Systems Architect, CCNA Coremetrics, Inc. Phone: 512-698-7250 email: [EMAIL PROTECTED] It is the part of a good shepherd to shear his flock, not to skin it. Latin Proverb -- Austin Gonyou Systems Architect, CCNA Coremetrics, Inc. Phone: 512-698-7250 email: [EMAIL PROTECTED] It is the part of a good shepherd to shear his flock, not to skin it. Latin Proverb
has anybody seen worker segfaults?
For some time some (but after 2.0.32), some tests I run have been segfaulting around the time of a graceful restart. Has anybody else seen something like this? [Tue Feb 19 10:31:43 2002] [notice] child pid 5367 exit signal Segmentation fault (11) [Tue Feb 19 10:31:43 2002] [notice] SIGUSR1 received. Doing graceful restart [Tue Feb 19 10:31:46 2002] [info] mod_unique_id: using ip addr 24.163.40.92 [Tue Feb 19 10:31:47 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- resuming normal operations [Tue Feb 19 10:31:47 2002] [info] Server built: Feb 19 2002 10:23:24 [Tue Feb 19 10:31:47 2002] [notice] child pid 5483 exit signal Segmentation fault (11) [Tue Feb 19 10:31:47 2002] [notice] SIGUSR1 received. Doing graceful restart [Tue Feb 19 10:31:50 2002] [info] mod_unique_id: using ip addr 24.163.40.92 [Tue Feb 19 10:31:51 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- resuming normal operations [Tue Feb 19 10:31:51 2002] [info] Server built: Feb 19 2002 10:23:24 [Tue Feb 19 10:31:51 2002] [notice] child pid 5579 exit signal Segmentation fault (11) [Tue Feb 19 10:31:51 2002] [notice] SIGUSR1 received. Doing graceful restart The test sends USR1 to the parent then sends in a request (and repeats that cycle 6 or so times). I'm not getting any core dumps from the segfaulting child (threads and Linux :) ). I need to spend more time looking into this, but first I wondered if anybody else saw it. RH 6.2:segfaults as (barely) described above Solaris 8: no segfaults AIX 4.3.3: no segfaults (typical: the platforms where I can get a coredump pretty reliably don't have the problem :) ) Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 Hmmm... -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
On Tue, Feb 19, 2002 at 12:16:13PM -0500, Jeff Trawick wrote: I'm not getting any core dumps from the segfaulting child (threads and Linux :) ). I need to spend more time looking into this, but first I wondered if anybody else saw it. RH 6.2:segfaults as (barely) described above Solaris 8: no segfaults AIX 4.3.3: no segfaults (typical: the platforms where I can get a coredump pretty reliably don't have the problem :) ) Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 This will only happen in ap_queue_push when apr_thread_mutex_lock or ap_thread_mutex_unlock fail (Yes, I do error checking on the pthread lock/unlock cases *grin*). I'm guessing this is a problem with pthread mutexes on whatever version of linux runs on RH6.2? -aaron
Re: has anybody seen worker segfaults?
Aaron Bannert [EMAIL PROTECTED] writes: Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 This will only happen in ap_queue_push when apr_thread_mutex_lock or ap_thread_mutex_unlock fail (Yes, I do error checking on the pthread lock/unlock cases *grin*). I'm guessing this is a problem with pthread mutexes on whatever version of linux runs on RH6.2? That log message was seen on AIX, not RH. Since we're guessing, I'll guess that there is a pool lifetime problem (catch-all?) :) To make some progress on this I'll get ap_queue_push() to log something meaningful (which APR call failed, which error) and see if that sheds any light on it (perhaps EFAULT is the error?). -- Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...
Re: has anybody seen worker segfaults?
On Tue, Feb 19, 2002 at 12:33:58PM -0500, Jeff Trawick wrote: Aaron Bannert [EMAIL PROTECTED] writes: Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 This will only happen in ap_queue_push when apr_thread_mutex_lock or ap_thread_mutex_unlock fail (Yes, I do error checking on the pthread lock/unlock cases *grin*). I'm guessing this is a problem with pthread mutexes on whatever version of linux runs on RH6.2? That log message was seen on AIX, not RH. This would not be the first time pthread problems have been seen under linux though. On RH 7.1 we get segfaults in the middle of the fork call with the prefork mpm under high load. We still haven't been able to figure out why this is happening, but it appears to be a problem with linux pthreads. Has anyone else been having problems with pthreads under linux? -adam -- I believe in Kadath in the cold waste, and Ultima Thule. But you cannot prove to me that Harvard Law School actually exists. - Theodora Goss I'm not like that, I have a cat, I don't need you.. My cat, and about 18 lines of bourne shell code replace you in life. - anonymous Adam Sussman Vidya Media Ventures [EMAIL PROTECTED]
Re: has anybody seen worker segfaults?
Jeff Trawick wrote: Maybe this is a hint... For a couple of the restart iterations, worker on AIX logs this: [crit] ap_queue_push failed with error code -1 In your AIX test environment, can you catch this error case in action by putting breakpoints at the two lines in ap_queue_push() where it's about to return -1? int ap_queue_push(fd_queue_t *queue, apr_socket_t *sd, apr_pool_t *p, apr_pool_t **recycled_pool) { /*...*/ if (apr_thread_mutex_lock(queue-one_big_mutex) != APR_SUCCESS) { return FD_QUEUE_FAILURE; } /*...*/ if (apr_thread_mutex_unlock(queue-one_big_mutex) != APR_SUCCESS) { return FD_QUEUE_FAILURE; } return FD_QUEUE_SUCCESS; } That might help isolate the source of the problem. My two guesses right now are: - pool lifetime problem, or - pthread library problem --Brian