Re: has anybody seen worker segfaults?

2002-02-23 Thread Jeff Trawick

Jeff Trawick [EMAIL PROTECTED] writes:

  t0   we need to fork() a new child for some reason
  t1   we get the graceful restart prod on the pod BEFORE
   the start_threads() thread has gotten dispatched and
   initialized worker_queue
  t2   we call signal_workers() which tries to use a NULL
   worker_queue and we segfault
 
 One possible fix for this is to initialize worker_queue in
 child_main() before creating the start_threads() thread so that there
 is no question that the init has been done before we try to use it.
 
 But there seem to be other timing issues as well.  Maybe we miss our
 chance to apr_thread_join() a worker thread right before it has been
 created.  A fix that would handle that as well as the first problem
 would be to join the start_threads() thread before trying to process a
 restart request.

After trying this the segfaults go away but we still have some process
bailing out prematurely (still get the accept mutex error, still get a
lot of dropped connections even though nobody is segfaulting).

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-22 Thread Jeff Trawick

Aaron Bannert [EMAIL PROTECTED] writes:

 On Thu, Feb 21, 2002 at 06:02:26PM -0500, Jeff Trawick wrote:
  I just tried to hit this on my Solaris x86 box with no luck.  I did
  200,000 simple requests with a SIGUSR1 sent to the parent ever 2
  seconds.  No segfaults or failed pthread calls, but at the end of the
  200,000 requests I had 177 stranded connections (client still waiting
  on a response on 177 connections).  I also had 177 connections to the
  httpd port in FIN_WAIT_2.  Weird...
 
 It is my understanding that FIN_WAIT_2 happens on the client side after
 the client does an active close but before the server does a close. If
 the server process died I'd expect the OS to send a FIN back to the
 client, so perhaps you have child processes that never died off from the
 previous generation?  On that theory, you would have 177 connections in
 CLOSE_WAIT on the server.

yep, of course you're absolutely right...

I started a document describing these issues at

  http://www.apache.org/~trawick/http_tcp.html

There are some windows at the end of the exchange where Linux netstat
(client side) and Solaris 8 netstat (server side) did not display
information about the connection, even though it still existed.

Also there are possible deviations in the order in which the exchange
can occur which would result in different TCP connection states.

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-22 Thread Aaron Bannert

On Fri, Feb 22, 2002 at 02:43:04PM -0500, Jeff Trawick wrote:
  It is my understanding that FIN_WAIT_2 happens on the client side after
  the client does an active close but before the server does a close. If
  the server process died I'd expect the OS to send a FIN back to the
  client, so perhaps you have child processes that never died off from the
  previous generation?  On that theory, you would have 177 connections in
  CLOSE_WAIT on the server.
 
 yep, of course you're absolutely right...

If you do indeed have connections hanging around on the server stuck
in CLOSE_WAIT after a graceful, then we probably have a bug in our
graceful restart code. Was this the case, or did they eventually go
away?

-aaron



Re: has anybody seen worker segfaults?

2002-02-22 Thread Jeff Trawick

Jeff Trawick [EMAIL PROTECTED] writes:

 For some time some (but after 2.0.32), some tests I run have been
 segfaulting around the time of a graceful restart.  Has anybody else
 seen something like this?

a new summary:

Here are some failure scenarios found when doing a graceful restart
while there were active connections:

a) 2 listening sockets on Linux, where we need an accept mutex:

 [emerg] (43)Identifier removed: apr_proc_mutex_lock failed. 
 Attempting to shutdown process gracefully.
 [emerg] (22)Invalid argument: apr_proc_mutex_unlock failed. 
 Attempting to shutdown process gracefully.

   The server process exits and existing connections are dropped.

b) 1 listening socket in Linux, where we don't need an accept
   mutex (intermittent failure):

   [notice] child pid 18314 exit signal Segmentation fault (11)

c) 1 listening socket on AIX, where we don't need an accept
   mutex (intermittent failure):

 [crit] (22)A system call received a parameter that is not
 valid.: ap_queue_push failed

d) Dale Ghent hit a segfault on Solaris 8 in ap_queue_interrupt_all()
   (NULL parameter passed in).

I would guess that the cause is the patch to stop using signals
(p=0.75) or the patch to reuse transaction pools (p=0.10).

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-22 Thread Jeff Trawick

Aaron Bannert [EMAIL PROTECTED] writes:

 On Fri, Feb 22, 2002 at 02:43:04PM -0500, Jeff Trawick wrote:
   It is my understanding that FIN_WAIT_2 happens on the client side after
   the client does an active close but before the server does a close. If
   the server process died I'd expect the OS to send a FIN back to the
   client, so perhaps you have child processes that never died off from the
   previous generation?  On that theory, you would have 177 connections in
   CLOSE_WAIT on the server.
  
  yep, of course you're absolutely right...
 
 If you do indeed have connections hanging around on the server stuck
 in CLOSE_WAIT after a graceful, then we probably have a bug in our
 graceful restart code. Was this the case, or did they eventually go
 away?

For the moment I'm assuming that it is a bug in the tool I was using.

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-22 Thread Jeff Trawick

Jeff Trawick [EMAIL PROTECTED] writes:

 b) 1 listening socket in Linux, where we don't need an accept
mutex (intermittent failure):
 
[notice] child pid 18314 exit signal Segmentation fault (11)
... 
 d) Dale Ghent hit a segfault on Solaris 8 in ap_queue_interrupt_all()
(NULL parameter passed in).

At least some segfaults I'm seeing on Linux match the Solaris
symptom.  The bug is that we're calling ap_queue_interrupt_all()
before initializing worker_queue.

 t0   we need to fork() a new child for some reason
 t1   we get the graceful restart prod on the pod BEFORE
  the start_threads() thread has gotten dispatched and
  initialized worker_queue
 t2   we call signal_workers() which tries to use a NULL
  worker_queue and we segfault

One possible fix for this is to initialize worker_queue in
child_main() before creating the start_threads() thread so that there
is no question that the init has been done before we try to use it.

But there seem to be other timing issues as well.  Maybe we miss our
chance to apr_thread_join() a worker thread right before it has been
created.  A fix that would handle that as well as the first problem
would be to join the start_threads() thread before trying to process a
restart request.

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-21 Thread Jeff Trawick

Jeff Trawick [EMAIL PROTECTED] writes:

 For some time some (but after 2.0.32), some tests I run have been
 segfaulting around the time of a graceful restart.  Has anybody else
 seen something like this?
 
 [Tue Feb 19 10:31:43 2002] [notice] child pid 5367 exit signal Segmentation fault 
(11)
 [Tue Feb 19 10:31:43 2002] [notice] SIGUSR1 received.  Doing graceful restart
 [Tue Feb 19 10:31:46 2002] [info] mod_unique_id: using ip addr 24.163.40.92
 [Tue Feb 19 10:31:47 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- 
resuming normal operations
 [Tue Feb 19 10:31:47 2002] [info] Server built: Feb 19 2002 10:23:24
 [Tue Feb 19 10:31:47 2002] [notice] child pid 5483 exit signal Segmentation fault 
(11)
 [Tue Feb 19 10:31:47 2002] [notice] SIGUSR1 received.  Doing graceful restart
 [Tue Feb 19 10:31:50 2002] [info] mod_unique_id: using ip addr 24.163.40.92
 [Tue Feb 19 10:31:51 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- 
resuming normal operations
 [Tue Feb 19 10:31:51 2002] [info] Server built: Feb 19 2002 10:23:24
 [Tue Feb 19 10:31:51 2002] [notice] child pid 5579 exit signal Segmentation fault 
(11)
 [Tue Feb 19 10:31:51 2002] [notice] SIGUSR1 received.  Doing graceful restart

I just tried to hit this on my Solaris x86 box with no luck.  I did
200,000 simple requests with a SIGUSR1 sent to the parent ever 2
seconds.  No segfaults or failed pthread calls, but at the end of the
200,000 requests I had 177 stranded connections (client still waiting
on a response on 177 connections).  I also had 177 connections to the
httpd port in FIN_WAIT_2.  Weird...

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-21 Thread Aaron Bannert

On Thu, Feb 21, 2002 at 06:02:26PM -0500, Jeff Trawick wrote:
 I just tried to hit this on my Solaris x86 box with no luck.  I did
 200,000 simple requests with a SIGUSR1 sent to the parent ever 2
 seconds.  No segfaults or failed pthread calls, but at the end of the
 200,000 requests I had 177 stranded connections (client still waiting
 on a response on 177 connections).  I also had 177 connections to the
 httpd port in FIN_WAIT_2.  Weird...

It is my understanding that FIN_WAIT_2 happens on the client side after
the client does an active close but before the server does a close. If
the server process died I'd expect the OS to send a FIN back to the
client, so perhaps you have child processes that never died off from the
previous generation?  On that theory, you would have 177 connections in
CLOSE_WAIT on the server.

-aaron



Re: has anybody seen worker segfaults?

2002-02-20 Thread Jeff Trawick

Brian Pane [EMAIL PROTECTED] writes:

 Jeff Trawick wrote:
 
 Maybe this is a hint...  For a couple of the restart iterations,
 worker on AIX logs this:
 
 [crit] ap_queue_push failed with error code -1
 
 
 In your AIX test environment, can you catch this error
 case in action by putting breakpoints at the two lines
 in ap_queue_push() where it's about to return -1?
 
 int ap_queue_push(fd_queue_t *queue, apr_socket_t *sd, apr_pool_t *p,
   apr_pool_t **recycled_pool)
 {
 /*...*/
 if (apr_thread_mutex_lock(queue-one_big_mutex) != APR_SUCCESS) {
 return FD_QUEUE_FAILURE;

This is returning EINVAL.

 That might help isolate the source of the problem.  My two
 guesses right now are:
- pool lifetime problem, or
- pthread library problem

Pool lifetime is by far the most likely suspect out of these two.

Consider that we've been happily obtaining/releasing that mutex all
along until restart time, when a process that is dieing hits that
problem.

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-20 Thread Aaron Bannert

On Wed, Feb 20, 2002 at 11:44:14AM -0500, Dale Ghent wrote:
 
 FWIW, I compiled up the latest CVS HEAD as of last night (just after the
 CAS stuff was re-added back into APR) on Solaris 8+sendfile, hit the
 server up pretty hard with ab, fetching 9.5k and 608k jpeg files thousands
 of times with 50 concurrent non-keepalive connections, and there was no
 core files or any errors reported in error_log.
 
 This was done with the worker mpm.
 
 I'll say 'congratulations!' to that!

Thanks!

Now try it again and hit bin/apachectl graceful in the middle of your
test [a few times]. :)

-aaron



Re: has anybody seen worker segfaults?

2002-02-20 Thread Dale Ghent

On Wed, 20 Feb 2002, Aaron Bannert wrote:

| Now try it again and hit bin/apachectl graceful in the middle of your
| test [a few times]. :)

Got a core with this. ab reported 159 (out of 2000) requests failed (in
the Length: category). Here's a bt:

#0  ap_queue_interrupt_all (queue=0x0) at fdqueue.c:219
219 if (apr_thread_mutex_lock(queue-one_big_mutex) !=
APR_SUCCESS) {
(gdb) where full

#0  ap_queue_interrupt_all (queue=0x0) at fdqueue.c:219
No locals.

#1  0x7 in child_main (child_num_arg=1) at worker.c:998
threads = (apr_thread_t **) 0x1bb7a8
i = 826368
rv = 1
ts = (thread_starter *) 0xc9c00
thread_attr = (apr_threadattr_t *) 0x10bdb0
start_thread_id = (apr_thread_t *) 0x10bdc0

#2  0x7469c in make_child (s=0x1777b8, slot=1) at worker.c:1071
pid = 0

#3  0x749a0 in perform_idle_server_maintenance () at worker.c:1233
i = 1
j = 0
idle_thread_count = 23
ps = (process_score *) 0xfee40060
free_length = 1
totally_free_length = 831488
free_slots = {1, 0, 1, 0, 0, 472476, 32769, 0, 1, -4261240, 1,
831488, 0, 6, 5, 813056, -4260988, -4260992, -4260984, 867320, 813056, 0,
-4261096, 477948, 10, 104, -4261096, 473240, 31, 104, -4261064, 6}
last_non_dead = 5
total_non_dead = 6

#4  0x74bd0 in server_main_loop (remaining_children_to_start=0)
at worker.c:1288
child_slot = 1
exitwhy = APR_PROC_EXIT
status = 0
processed_status = 0
pid = {pid = -1, in = 0xffbefc04, out = 0x99280, err = 0x110e70}
i = 25

#5  0x74e70 in ap_mpm_run (_pconf=0x2, plog=0x113cf8, s=0xc6800)
at worker.c:1413
remaining_children_to_start = 2
rv = 827392

#6  0x7ad38 in main (argc=1537976, argv=0xd1c70) at main.c:500
c = 0 '\000'
configtestonly = 0
confname = 0xae7f8 conf/httpd.conf
def_server_root = 0xae7e8 /local/apache2
process = (process_rec *) 0xd1c70
server_conf = (server_rec *) 0x1777b8
pglobal = (apr_pool_t *) 0xc6800
pconf = (apr_pool_t *) 0xd3bf8
plog = (apr_pool_t *) 0x113cf8
ptemp = (apr_pool_t *) 0x10bcd8
pcommands = (apr_pool_t *) 0x111cf0
opt = (apr_getopt_t *) 0x111d88
rv = 1129720
mod = (module **) 0x1777b8
optarg = 0x3 Address 0x3 out of bounds





Re: has anybody seen worker segfaults?

2002-02-20 Thread Aaron Bannert

On Wed, Feb 20, 2002 at 11:59:15AM -0500, Dale Ghent wrote:
 On Wed, 20 Feb 2002, Aaron Bannert wrote:
 
 | Now try it again and hit bin/apachectl graceful in the middle of your
 | test [a few times]. :)
 
 Got a core with this. ab reported 159 (out of 2000) requests failed (in
 the Length: category). Here's a bt:

Looks like a pool lifetime problems to me. I'll look into this soonish
unless someone beats me to it.

-a



PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread Justin Erenkrantz

On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
 PHP4.1.1 or  not working?

You have to have the version from CVS in order to get it to
compile.  -- justin




Re: PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread Justin Erenkrantz

On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote:
 On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
  PHP4.1.1 or  not working?
 
 You have to have the version from CVS in order to get it to
 compile.  -- justin

Err, just to make it clear, you need the latest version of PHP
from their CVS repository.  We made changes in the .31 timeframe
to our input filters and they haven't done a release since then.
DougM committed the relevant fixes to PHP's repository right after
we changed it here.  -- justin




Re: PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread Austin Gonyou

NP. I am using the CVS as of last night. That's why I'm writing, and
that's why I said 4.1.1, I guess it should've been 4.1.1 :) 

On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote:
 On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote:
  On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
   PHP4.1.1 or  not working?
  
  You have to have the version from CVS in order to get it to
  compile.  -- justin
 
 Err, just to make it clear, you need the latest version of PHP
 from their CVS repository.  We made changes in the .31 timeframe
 to our input filters and they haven't done a release since then.
 DougM committed the relevant fixes to PHP's repository right after
 we changed it here.  -- justin
-- 
Austin Gonyou
Systems Architect, CCNA
Coremetrics, Inc.
Phone: 512-698-7250
email: [EMAIL PROTECTED]

It is the part of a good shepherd to shear his flock, not to skin it.
Latin Proverb



Re: PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread Austin Gonyou

FYI. Not an compilation problem, HTTPD just doesn't do anything, but
doesn't write a log either, and only 1 process is started. 

On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote:
 On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote:
  On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
   PHP4.1.1 or  not working?
  
  You have to have the version from CVS in order to get it to
  compile.  -- justin
 
 Err, just to make it clear, you need the latest version of PHP
 from their CVS repository.  We made changes in the .31 timeframe
 to our input filters and they haven't done a release since then.
 DougM committed the relevant fixes to PHP's repository right after
 we changed it here.  -- justin
-- 
Austin Gonyou
Systems Architect, CCNA
Coremetrics, Inc.
Phone: 512-698-7250
email: [EMAIL PROTECTED]

It is the part of a good shepherd to shear his flock, not to skin it.
Latin Proverb



RE: PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread MATHIHALLI,MADHUSUDAN (HP-Cupertino,ex1)

Austin,
I was working on this some time back, and I don't believe it's a
Apache problem (it might be a better idea to move it to the PHP mailing
list)

The problem occurs because of the way the PHP context is handled in
php_output_filter (sapi_apache2.c).. Here's something that I observed :
The function php_output_filter() is called  2 times because of the way the
output data is handled by the filter.. 
- the first time around some sort of initialization is done.
- during the second round, the data is sent out
- then i don't know what happens (possibly wrong termination of the o/p
brigade), and the php_output_filter is called for the third time.. The
filter knows about such a thing happening, but the code is not written
properly to handle the situation.. The PHP context (ctx) is corrupted, and
the PHP module bombs - you end up seeing only the parent apache process.

(Due to my limited PHP knowledge) I introduced the following snippet just
before the ap_save_brigade(...) block, and it seems to bring up apache 

if ((ctx-state  0) || (ctx-state  2)) {
ap_log_error(APLOG_MARK, APLOG_DEBUG,
 0, NULL, PHP : Unrecognized state!);
return 0;
}

This is a workaround and not a resolution/fix.. I'd appreciate if anybody
could post a fix for this..

Thanks,
-Madhu




-Original Message-
From: Austin Gonyou [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 20, 2002 10:56 AM
To: [EMAIL PROTECTED]
Subject: Re: PHP4 was Re: has anybody seen worker segfaults?


FYI. Not an compilation problem, HTTPD just doesn't do anything, but
doesn't write a log either, and only 1 process is started. 

On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote:
 On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote:
  On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
   PHP4.1.1 or  not working?
  
  You have to have the version from CVS in order to get it to
  compile.  -- justin
 
 Err, just to make it clear, you need the latest version of PHP
 from their CVS repository.  We made changes in the .31 timeframe
 to our input filters and they haven't done a release since then.
 DougM committed the relevant fixes to PHP's repository right after
 we changed it here.  -- justin
-- 
Austin Gonyou
Systems Architect, CCNA
Coremetrics, Inc.
Phone: 512-698-7250
email: [EMAIL PROTECTED]

It is the part of a good shepherd to shear his flock, not to skin it.
Latin Proverb



RE: PHP4 was Re: has anybody seen worker segfaults?

2002-02-20 Thread Austin Gonyou

NP. I'll move it over to that list then. I just wanted to bring it up
here first since I've no output from apache at all. Strace didn't help
much either(no this time). 

Thanks for the info Madhu. If I get an actual fix/resolution for this,
I'll be sure to let everyone know, so we've closure. 


On Wed, 2002-02-20 at 16:13, MATHIHALLI,MADHUSUDAN (HP-Cupertino,ex1)
wrote:
 Austin,
   I was working on this some time back, and I don't believe it's a
 Apache problem (it might be a better idea to move it to the PHP mailing
 list)
 
 The problem occurs because of the way the PHP context is handled in
 php_output_filter (sapi_apache2.c).. Here's something that I observed :
 The function php_output_filter() is called  2 times because of the way
 the
 output data is handled by the filter.. 
 - the first time around some sort of initialization is done.
 - during the second round, the data is sent out
 - then i don't know what happens (possibly wrong termination of the o/p
 brigade), and the php_output_filter is called for the third time.. The
 filter knows about such a thing happening, but the code is not written
 properly to handle the situation.. The PHP context (ctx) is corrupted,
 and
 the PHP module bombs - you end up seeing only the parent apache process.
 
 (Due to my limited PHP knowledge) I introduced the following snippet
 just
 before the ap_save_brigade(...) block, and it seems to bring up apache 
 
 if ((ctx-state  0) || (ctx-state  2)) {
 ap_log_error(APLOG_MARK, APLOG_DEBUG,
  0, NULL, PHP : Unrecognized state!);
 return 0;
 }
 
 This is a workaround and not a resolution/fix.. I'd appreciate if
 anybody
 could post a fix for this..
 
 Thanks,
 -Madhu
 
 
 
 
 -Original Message-
 From: Austin Gonyou [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, February 20, 2002 10:56 AM
 To: [EMAIL PROTECTED]
 Subject: Re: PHP4 was Re: has anybody seen worker segfaults?
 
 
 FYI. Not an compilation problem, HTTPD just doesn't do anything, but
 doesn't write a log either, and only 1 process is started. 
 
 On Wed, 2002-02-20 at 12:19, Justin Erenkrantz wrote:
  On Wed, Feb 20, 2002 at 10:16:03AM -0800, Justin Erenkrantz wrote:
   On Wed, Feb 20, 2002 at 12:03:00PM -0600, Austin Gonyou wrote:
PHP4.1.1 or  not working?
   
   You have to have the version from CVS in order to get it to
   compile.  -- justin
  
  Err, just to make it clear, you need the latest version of PHP
  from their CVS repository.  We made changes in the .31 timeframe
  to our input filters and they haven't done a release since then.
  DougM committed the relevant fixes to PHP's repository right after
  we changed it here.  -- justin
 -- 
 Austin Gonyou
 Systems Architect, CCNA
 Coremetrics, Inc.
 Phone: 512-698-7250
 email: [EMAIL PROTECTED]
 
 It is the part of a good shepherd to shear his flock, not to skin it.
 Latin Proverb
-- 
Austin Gonyou
Systems Architect, CCNA
Coremetrics, Inc.
Phone: 512-698-7250
email: [EMAIL PROTECTED]

It is the part of a good shepherd to shear his flock, not to skin it.
Latin Proverb



has anybody seen worker segfaults?

2002-02-19 Thread Jeff Trawick

For some time some (but after 2.0.32), some tests I run have been
segfaulting around the time of a graceful restart.  Has anybody else
seen something like this?

[Tue Feb 19 10:31:43 2002] [notice] child pid 5367 exit signal Segmentation fault (11)
[Tue Feb 19 10:31:43 2002] [notice] SIGUSR1 received.  Doing graceful restart
[Tue Feb 19 10:31:46 2002] [info] mod_unique_id: using ip addr 24.163.40.92
[Tue Feb 19 10:31:47 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- 
resuming normal operations
[Tue Feb 19 10:31:47 2002] [info] Server built: Feb 19 2002 10:23:24
[Tue Feb 19 10:31:47 2002] [notice] child pid 5483 exit signal Segmentation fault (11)
[Tue Feb 19 10:31:47 2002] [notice] SIGUSR1 received.  Doing graceful restart
[Tue Feb 19 10:31:50 2002] [info] mod_unique_id: using ip addr 24.163.40.92
[Tue Feb 19 10:31:51 2002] [notice] Apache/2.0.33-dev (Unix) DAV/2 configured -- 
resuming normal operations
[Tue Feb 19 10:31:51 2002] [info] Server built: Feb 19 2002 10:23:24
[Tue Feb 19 10:31:51 2002] [notice] child pid 5579 exit signal Segmentation fault (11)
[Tue Feb 19 10:31:51 2002] [notice] SIGUSR1 received.  Doing graceful restart

The test sends USR1 to the parent then sends in a request (and repeats
that cycle 6 or so times).

I'm not getting any core dumps from the segfaulting child (threads and
Linux :) ).  I need to spend more time looking into this, but first I
wondered if anybody else saw it.

RH 6.2:segfaults as (barely) described above
Solaris 8: no segfaults
AIX 4.3.3: no segfaults
(typical: the platforms where I can get a coredump pretty reliably
don't have the problem :) )

Maybe this is a hint...  For a couple of the restart iterations,
worker on AIX logs this:

[crit] ap_queue_push failed with error code -1

Hmmm...
-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-19 Thread Aaron Bannert

On Tue, Feb 19, 2002 at 12:16:13PM -0500, Jeff Trawick wrote:
 I'm not getting any core dumps from the segfaulting child (threads and
 Linux :) ).  I need to spend more time looking into this, but first I
 wondered if anybody else saw it.
 
 RH 6.2:segfaults as (barely) described above
 Solaris 8: no segfaults
 AIX 4.3.3: no segfaults
 (typical: the platforms where I can get a coredump pretty reliably
 don't have the problem :) )
 
 Maybe this is a hint...  For a couple of the restart iterations,
 worker on AIX logs this:
 
 [crit] ap_queue_push failed with error code -1

This will only happen in ap_queue_push when apr_thread_mutex_lock or
ap_thread_mutex_unlock fail (Yes, I do error checking on the
pthread lock/unlock cases *grin*).

I'm guessing this is a problem with pthread mutexes on whatever version
of linux runs on RH6.2?

-aaron



Re: has anybody seen worker segfaults?

2002-02-19 Thread Jeff Trawick

Aaron Bannert [EMAIL PROTECTED] writes:

  Maybe this is a hint...  For a couple of the restart iterations,
  worker on AIX logs this:
  
  [crit] ap_queue_push failed with error code -1
 
 This will only happen in ap_queue_push when apr_thread_mutex_lock or
 ap_thread_mutex_unlock fail (Yes, I do error checking on the
 pthread lock/unlock cases *grin*).
 
 I'm guessing this is a problem with pthread mutexes on whatever version
 of linux runs on RH6.2?

That log message was seen on AIX, not RH.

Since we're guessing, I'll guess that there is a pool lifetime
problem (catch-all?) :)

To make some progress on this I'll get ap_queue_push() to log
something meaningful (which APR call failed, which error) and see if
that sheds any light on it (perhaps EFAULT is the error?).

-- 
Jeff Trawick | [EMAIL PROTECTED] | PGP public key at web site:
   http://www.geocities.com/SiliconValley/Park/9289/
 Born in Roswell... married an alien...



Re: has anybody seen worker segfaults?

2002-02-19 Thread Adam Sussman

On Tue, Feb 19, 2002 at 12:33:58PM -0500, Jeff Trawick wrote:
 Aaron Bannert [EMAIL PROTECTED] writes:
 
   Maybe this is a hint...  For a couple of the restart iterations,
   worker on AIX logs this:
   
   [crit] ap_queue_push failed with error code -1
  
  This will only happen in ap_queue_push when apr_thread_mutex_lock or
  ap_thread_mutex_unlock fail (Yes, I do error checking on the
  pthread lock/unlock cases *grin*).
  
  I'm guessing this is a problem with pthread mutexes on whatever version
  of linux runs on RH6.2?
 
 That log message was seen on AIX, not RH.
 

This would not be the first time pthread problems have been seen under
linux though.  On RH 7.1 we get segfaults in the middle of the fork
call with the prefork mpm under high load.  We still haven't been able
to figure out why this is happening, but it appears to be a problem
with linux pthreads.

Has anyone else been having problems with pthreads under linux?

-adam

-- 

I believe in Kadath in the cold waste, and Ultima Thule. But you
 cannot prove to me that Harvard Law School actually exists.
- Theodora Goss

I'm not like that, I have a cat, I don't need you.. My cat, and
 about 18 lines of bourne shell code replace you in life.
- anonymous


Adam Sussman
Vidya Media Ventures

[EMAIL PROTECTED]




Re: has anybody seen worker segfaults?

2002-02-19 Thread Brian Pane

Jeff Trawick wrote:

Maybe this is a hint...  For a couple of the restart iterations,
worker on AIX logs this:

[crit] ap_queue_push failed with error code -1


In your AIX test environment, can you catch this error
case in action by putting breakpoints at the two lines
in ap_queue_push() where it's about to return -1?

int ap_queue_push(fd_queue_t *queue, apr_socket_t *sd, apr_pool_t *p,
  apr_pool_t **recycled_pool)
{
/*...*/
if (apr_thread_mutex_lock(queue-one_big_mutex) != APR_SUCCESS) {
return FD_QUEUE_FAILURE;
}
/*...*/
if (apr_thread_mutex_unlock(queue-one_big_mutex) != APR_SUCCESS) {
return FD_QUEUE_FAILURE;
}

return FD_QUEUE_SUCCESS;
}

That might help isolate the source of the problem.  My two
guesses right now are:
   - pool lifetime problem, or
   - pthread library problem

--Brian