Indeed, workers were crashing and getting restarted. What helped me track down which requests were crashing, and how, amidst the other junk echoed to my console was:
1. export STARMAN_DEBUG=1 2. Stop echoing all bytes sent to the console when STARMAN_DEBUG==1 3. Add a debugging echo upon each read in _prepare_env(), not just _read_headers() 4. Increase read_timeout in Starman/Server.pm to eliminate one source of failures 5. Change debugging output of Starman/Server.pm to echo the read information to different files, one named after each process. Then I discovered that one of the worker processes was reading a request in its entirety, but never getting to the point where "Closing connection" would be echoed as would typically happen. Debugging my application, I confirmed a condition that actually sent the worker process a kill signal. My small and ugly debugging patch to Starman/Server.pm is attached. Thanks for all the suggestions! - Maxwell On Thursday, January 5, 2017 at 10:35:03 PM UTC-5, Eric Wong wrote: > > maxwel...@gmail.com <javascript:> wrote: > > That's a command I didn't know about - thanks! > > > > My traffic is bursty, rather than steadily high. > > Whenever I manage to execute "ss -tl", on the load balancer or > > the backend, I see all zeros in the Recv-Q column, even if I've > > occupied the Starman workers. > > Yeah, if the worker has actually made an accept/accept4 syscall, > that'll move the connection out of the backlog so it won't show > up in Recv-Q. > > > To be sure, I'm not certain that I've occupied exactly five workers > > at the instants I get the 502 errors. Serving the application requests > > involves > > some rather heavy processing, and some other system limit, such as > memory, > > could be coming into play. However, I've never had problems hitting the > > system process limit or having the workers that are running fail due to > > memory problems. So it seems likely to me that Starman closes the > > connection "voluntarily," based on its own decision that it can't serve > the > > incoming request. > > I doubt it, starman is one-connection-per-process, so it > won't make any accept/accept4 calls while it's doing application > processing. You can `strace -p $PID_OF_WORKER -e accept,accept4` > to be sure. > > You should also also try using curl to hit the starman instances > directly and bypass nginx. > > > In my ngin error.log on the load balancer, I see the message: > > > > "*6528 upstream prematurely closed connection while reading response > header > > from upstream" > > If all else fails, perhaps try a simple PSGI app that just calls > "sleep($A_FEW_SECONDS)" before returning to see if you can > simulate it. It may also be a network or firewall problem > between nginx and starman since they're on different boxes. > > > Are there other conditions that could cause a connection to be closed > > prematurely? > > Maybe workers are crashing/exiting and getting restarted? > > > Also, where in the source code should I look for this? > > Starman itself appears to be pure Perl, with no reference to the > "backlog" > > parameter that it parses. Which underlying module is doing the heavy > > lifting, and where does it make a decision about returning a 502 > > or closing a connection? > > Probably somewhere in the Net::Server::* suite which starman > inherits from. I'm not familiar with the code myself, but > the "backlog" should be the parameter passed to the listen(2) > system call which you can verify via strace. > -- --- You received this message because you are subscribed to the Google Groups "psgi-plack" group. To unsubscribe from this group and stop receiving emails from it, send an email to psgi-plack+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Index: Starman/Server.pm =================================================================== --- Starman/Server.pm (revision 5239) +++ Starman/Server.pm (revision 5241) @@ -64,8 +64,9 @@ $options->{keepalive_timeout} = 1; } if (! exists $options->{read_timeout}) { - $options->{read_timeout} = 5; + $options->{read_timeout} = 555; } + $options->{read_timeout} = 555; if (! exists $options->{proctitle}) { $options->{proctitle} = 1; } @@ -329,6 +330,10 @@ } } + open my $out_h, '>>', "/mnt/tmp/starman.$$.info"; + print $out_h "[$$] Closing connection\n"; + close $out_h; + DEBUG && warn "[$$] Closing connection\n"; } @@ -335,6 +340,9 @@ sub _read_headers { my $self = shift; + open my $out_h, '>>', "/mnt/tmp/starman.$$.info"; + print $out_h "[$$] Reading headers of request\n"; + eval { local $SIG{ALRM} = sub { die "Timed out\n"; }; @@ -350,12 +358,16 @@ my $read = sysread $self->{server}->{client}, my $buf, CHUNKSIZE; if ( !defined $read || $read == 0 ) { - die "Read error: $!\n"; + print $out_h "[$$] Read error 1 in read_headers: $!\n"; + close $out_h; + die "Read error 1: $!\n"; } - if ( DEBUG ) { + if ( DEBUG > 1) { warn "[$$] Read $read bytes: " . dump($buf) . "\n"; - } + } elsif(DEBUG) { warn "[$$] Read $read bytes\n" } + print $out_h "$buf\n"; + print $out_h "[$$] Read $read bytes\n"; $self->{client}->{inputbuf} .= $buf; } @@ -362,18 +374,23 @@ }; alarm(0); - - if ( $@ ) { - if ( $@ =~ /Timed out/ ) { + + if( $@ ) { + if( $@ =~ m/Timed out/ ) { DEBUG && warn "[$$] Client connection timed out\n"; + print $out_h "[$$] Client connection timed out\n"; + close $out_h; return; } if ( $@ =~ /Read error/ ) { DEBUG && warn "[$$] Read error: $!\n"; + print $out_h "[$$] Read error\n"; + close $out_h; return; } } + close $out_h; # Pull out the complete header into a new buffer $self->{client}->{headerbuf} = $self->{client}->{inputbuf}; @@ -403,6 +420,8 @@ sub _prepare_env { my($self, $env) = @_; + open my $out_h, '>>', "/mnt/tmp/starman.$$.info"; + my $get_chunk = sub { if ($self->{client}->{inputbuf}) { my $chunk = delete $self->{client}->{inputbuf}; @@ -409,6 +428,8 @@ return ($chunk, length $chunk); } my $read = sysread $self->{server}->{client}, my($chunk), CHUNKSIZE; + print $out_h "$chunk\n"; + print $out_h "[$$] Read $read bytes in prepare_env\n"; return ($chunk, $read); }; @@ -420,7 +441,9 @@ my($chunk, $read) = $get_chunk->(); if ( !defined $read || $read == 0 ) { - die "Read error: $!\n"; + print $out_h "[$$] Read error 2 in prepare_env\n"; + close $out_h; + die "Read error 2: $!\n"; } $cl -= $read; @@ -462,6 +485,7 @@ } else { $env->{'psgi.input'} = $null_io; } + close $out_h; } sub _finalize_response {