I went to play with one of my demon-children (300k seconds on her now) and looked some at the code.
(gdb) print (char *) timeout_name $7 = 0x5c480 "proxy send body" So, at least I know where it thinks the last timeout was set. Will take me a while to grok proxy_send_fb, but that seems to be where the problem is. Looking through the logs, I see at about the same time that this child was last seen fetching stuff from www.cdrom.com: [Fri Apr 11 14:59:16 1997] proxy send body timed out for 192.168.1.32 [Fri Apr 11 15:01:50 1997] proxy send body timed out for 192.168.1.32 [Fri Apr 11 15:02:10 1997] proxy send body timed out for 192.168.1.32 It looks like the timeout was received but send_body never saw it. I won't profess to be an expert on Apache's timeouts, but it looks like there should be either the timeout_req flag set here (it's not). It looks like current_conn->aborted should be set, though gdb won't let me check. All the while's in proxy_send_fb do check that... but there is a while loop in saferead that doesn't and I don't see an easy way to make it get to the connection structure. Would changing that to a hard_timeout fix it? > You may want to double check any soft_timeout()s. If any of them go off, > there will be no further timeouts around and the code is expected to > properly check connection->aborted before doing anything. A quick glance > doesn't show any such obvious things, but it is possible. Temporarily > replacing all soft_timeout()s with hard_timeout()s would tell if this is > the problem, although it could break some things; haven't really looked at > the code... > > On Sun, 13 Apr 1997, Brian Moore wrote: > > > Nope, it's a http transfer (but a large one). Not sure what it is, > > though: it seems to be that alarm(0) is getting called [which in my > > looking at the code is a bad thing to do] somewhere. The client request > > on this is on the other side of a packet filtering router, but at 10mbps, > > so it shouldn't be a client timeout. > > > > Since it got through the flush-the-buffer stuff in saferead(), I think > > it's not the speed, just the dropped alarm. See I printed out the value > > of alarms_blocked, which in theory should mean it's not blocked. :) I've > > left a couple of these children running (there were three transferring > > files via http from www.cdrom.com. The specific URLs (though I doubt it > > matters) GET http://www.cdrom.com/pub/quake/quakec/weapons/mini20.zip > > and > > GET http://www.cdrom.com/pub/quake/quakec/weapons/pnc1_02b.zip > > I killed the third demon-child -- those two are still running. > > > > Since we have about 50 machines on the far side of this router using > > illegal IP's, it may be hard to spot: they do a good amount of web/ftp > > access and it all runs through Apache, so it's a rare occurence. > > > > When tracking down the missing unblocks, I did insert some code to > > whine... something like: > > > > alarm_save = alarm(0); > > if (!alarm_save ) > > syslog(LOG_DAEMON | LOG_EMERG, "saferead, no alarm! %p", > > getpid()); > > else > > alarm(alarm_save); > > > > When that code was in place in 1.2b7 I did see a bunch of times saferead > > was getting called with no alarm, which shouldn't happen (though I will > > confess I don't know what alarm() returns if say 1/2 a second is > > remaining on Solaris). > > > > I'll see if I can find who was downloading quake this week and if they > > did anything like abort it or anything. > > > > On Sun, 13 Apr 1997, Chuck Murcko wrote: > > > > > Thanks for the report, Brian. It looks like a large file transfer is > > > indeed punching through a soft timeout. I assume these are FTP > > > transfers? I can duplicate your environment, so I should see the > > > problem when I test for it. > > > > > > Brian Moore wrote: > > > > > > > > >Number: 374 > > > > >Category: mod_proxy > > > > >Synopsis: mod_proxy(?) seems to alarm(0) somewhere > > > > >Confidential: no > > > > >Severity: serious > > > > >Priority: medium > > > > >Responsible: apache (Apache HTTP Project) > > > > >State: open > > > > >Class: sw-bug > > > > >Submitter-Id: apache > > > > >Arrival-Date: Sun Apr 13 12:00:01 1997 > > > > >Originator: [EMAIL PROTECTED] > > > > >Organization: > > > > apache > > > > >Release: 1.2b8 > > > > >Environment: > > > > Solaris 2.5, all recommended patches, gcc 2.7.2 > > > > >Description: > > > > Looks like there's one other problem in mod_proxy with alarms being > > > > turned off (not blocked via the block_alarms() call, but alarm(0)'d > > > > for some reason). I'm guessing on the module involved, since the > > > > three dead children this morning were all doing proxy stuff. > > > > > > > > The backtrace of a child that's been waiting for 110k seconds: > > > > #0 0xef67792c in _read () > > > > #1 0x29364 in saferead () > > > > #2 0x29480 in bread () > > > > #3 0x488b0 in proxy_send_fb () > > > > #4 0x47e78 in proxy_http_handler () > > > > #5 0x432c0 in proxy_handler () > > > > #6 0x1f040 in invoke_handler () > > > > #7 0x21dc0 in process_request_internal () > > > > #8 0x21df4 in process_request () > > > > #9 0x1bf30 in child_main () > > > > #10 0x1c0cc in make_child () > > > > #11 0x1c8c8 in standalone_main () > > > > #12 0x1cb88 in main () > > > > (gdb) up > > > > #1 0x29364 in saferead () > > > > (gdb) print alarms_blocked > > > > $1 = 0 > > > > > > > > So this seems to be something calling alarm(0) somewhere instead of a > > > > 'logical' alarms-off via the official mechanism. > > > > > > > > >How-To-Repeat: > > > > Not sure: virtually all of our proxy users are on a 10Mbps ethernet > > > > but behind a firewall. This usage may or may not be relevant. The > > > > children I found dead this morning were fetching files from cdrom.com > > > > via http, so it should be normal the only odd thing is that these were > > > > quake files so they were no doubt huge. >Fix: > > > > Will be looking at the code myself this week > > > > >Audit-Trail: > > > > >Unformatted: > > > > > > -- > > > chuck > > > Chuck Murcko > > > The Topsail Group, West Chester PA USA > > > [EMAIL PROTECTED] > > > > > > > > >
