Re: only 503s after a few hours (trunk-4384)

2009-12-10 Thread Sten Spans
On Wed, 9 Dec 2009, Fernando Papa wrote:

 Varnish account does not have limits (ulimit returns unlimited).

What does ulimit -n return?
Most newer linux distro's set this to 1024 by default,
which might not be enough for a busy varnish.

-- 
Sten Spans

There is a crack in everything, that's how the light gets in.
Leonard Cohen - Anthem
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: only 503s after a few hours (trunk-4384)

2009-12-10 Thread Fernando Papa
Hello,

5 Debug- Too many open files when accept(2)ing. Sleeping.
EOF

 That's your trouble:  The kernel does not allow varnish to open
 as many filedescriptors as it needs.

 Use whatever means your kernel has to change this number, possibly
 ulimit(1) or similar.


That certainly improves uptime.
I believe it is leaking descriptors.  It can't possible use so many with the 
kind of traffic (~7 req/s) it's receiving.

I restarted it yesterday with 128K descriptors.

# ps -AF | grep varn
root  4726 1  0 21419   928   2 Dec09 ?00:00:00 varnishd -a 
*:80 -f frontdoor.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4
nobody4727  4726  0 140497 153232 3 Dec09 ?00:06:48 varnishd -a 
*:80 -f frontdoor.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4

# ls -l /proc/4727/fd | wc -l
1253

It's currently using 1253 descriptors.

# ls -lat /proc/4727/fd | grep socket | head -n 1
lrwx-- 1 nobody nobody 64 Dec 10 11:40 1244 - socket:[17728046]

# ls -lat /proc/4727/fd | grep socket | tail -n 5
lrwx-- 1 nobody nobody 64 Dec  9 15:27 23 - socket:[16402456]
lrwx-- 1 nobody nobody 64 Dec  9 15:27 24 - socket:[16261656]
lrwx-- 1 nobody nobody 64 Dec  9 15:27 25 - socket:[16411051]
lrwx-- 1 nobody nobody 64 Dec  9 15:27 4 - socket:[16216281]
lrwx-- 1 nobody nobody 64 Dec  9 15:27 5 - socket:[16216282]

# ls -lat /proc/4727/fd | grep 15: | wc -l
76

It has 76 sockets that are 20 hours old.  Listening and monitoring sockets 
might explain a few of these.  [[ NOTE: Re-checked after an hour and it 
still has 76 ]]

# lsof -l -R -V -Tqs | grep 16216282
varnishd 4727 4726 99 5u IPv4 16216282 TCP *:http (LISTEN QR=0 QS=0)

OK.  16216281 also listens, but is IPv6.

# lsof -l -R -V -Tqs | grep 16411051
varnishd   4727  4726   99   25u sock0,5 
16411051 can't identify protocol

# grep 16411051 /proc/net/*
EOF

# netstat --verbose --numeric --extend --all --notrim | grep 16411051
EOF

___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: only 503s after a few hours (trunk-4384)

2009-12-09 Thread Fernando Papa
Happened again.  This time I left varnishlog running, and got...

  989 TxStatus c 503
  989 TxResponse   c Service Unavailable
  989 TxHeader c Server: Varnish
  989 TxHeader c Date: Wed, 09 Dec 2009 00:38:07 GMT
  989 TxHeader c Age: 0
  989 TxHeader c Connection: close
  989 ReqEnd   c 1933950257 1260319087.004358053 1260319087.004553556 
0.67711 0.000121117 0.74387
  989 SessionClose c error
  989 StatSess c 189.165.32.95 1480 0 1 1 0 0 0 117 0
5 Debug- Too many open files when accept(2)ing. Sleeping.
EOF

That Too many... debug message appears 709 times.  First one:

line number 14292239
 1023 TxProtocol   c HTTP/1.1
 1023 TxStatus c 200
 1023 TxResponse   c OK
 1023 TxHeader c Server: xxx
 1023 TxHeader c Cache-Control: public, max-age=3600
 1023 TxHeader c Content-Type: application/x-shockwave-flash
 1023 TxHeader c Content-Length: 659
 1023 TxHeader c Date: Wed, 09 Dec 2009 00:10:05 GMT
 1023 TxHeader c Age: 972
 1023 TxHeader c Connection: keep-alive
 1023 ReqEnd   c 1933931697 1260317405.392798424 1260317405.392909527 
0.62466 0.48637 0.62466
  706 Debugc herding
  695 Debugc herding
 1007 Debugc herding
 1023 Debugc herding
5 Debug- Too many open files when accept(2)ing. Sleeping.
  937 ReqStart c 190.13.3.173 51259 1933931698
  937 RxRequestc GET
  937 RxURLc xxx
  937 RxProtocol   c HTTP/1.1
  937 RxHeader c Accept: */*
  937 RxHeader c Referer: xxx
  937 RxHeader c Accept-Language: es-co
  937 RxHeader c UA-CPU: x86
  937 RxHeader c Accept-Encoding: gzip, deflate
  937 RxHeader c User-Agent: xxx
  937 RxHeader c Host: xxx
  937 RxHeader c Connection: Keep-Alive
  937 RxHeader c Cookie: xxx=xxx
  937 VCL_call c recv
  937 VCL_return   c lookup
  937 VCL_call c hash
  937 VCL_return   c hash
  937 VCL_call c miss
  937 VCL_return   c fetch
  937 FetchError   c no backend connection
  937 VCL_call c error
  937 VCL_return   c deliver

Varnish account does not have limits (ulimit returns unlimited).  Memory
usage (top): 403m/89m/81m.

# ps -AF | grep varnish
root 19223 1  0 21419   928   2 Dec08 ?00:00:00 varnishd -a 
*:80 -f xxx.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4
nobody   19224 19223  0 103378 91244  3 Dec08 ?00:06:52 varnishd -a 
*:80 -f xxx.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4

# kill 19224

# ps -AF | grep varnish
root 19223 1  0 21419   960   2 Dec08 ?00:00:00 varnishd -a 
*:80 -f xxx.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4
nobody   31319 19223  1 94673  3532   3 12:33 ?00:00:00 varnishd -a 
*:80 -f xxx.vcl -p obj_workspace 4096 -s malloc,100M -p thread_pools 4

That made it work again --as in processing requests--, except for varnishlog 
that continues to show nothing.

- Original Message - 
From: Fernando Papa fernando.p...@liveinteractive.net
To: varnish-misc@projects.linpro.no
Sent: Tuesday, December 08, 2009 2:01 PM
Subject: only 503s after a few hours (trunk-4384)


 Hello,

 After a few hours, Varnish trunk-4384 started replying with 503.
 New work-load, first day of use.  Previously used Varnish only with 
 internal
 traffic (eg no crazy headers, just carefully crafted requests).

 Facts:

 * Xeon 2.4 GHz (quad-core, 512KBx4 cache) server.  CentOS 5.4.
 * Worked fine for ~9 hours, sending to backend 150+ K requests plus 
 Varnish
 pings.
 * Backend is an application server.  Payload is 1 KB plus minimal 
 headers.
 * I can still access backend bypassing Varnish.
 * Varnish continues pinging backend, and receiving a timely 200 OK every
 time (backend logs).
 * req.backend.healthy = true.  Tested with a codepath along the lines
 (req.request == GET)  (req.url == /)  (req.backend.healthy), that
 does an 'error 200 OK'.
 * varnishlog does not output anything at all.  Not even the backend pings.
 * Tested every path within vcl_recv.  Exiting with error works, but
 lookup results in a 503.
 * Very low CPU usage.  Sorry, didn't check memory usage before restarting
 it.

 CFLAGS='-O2 -fomit-frame-pointer -march=pentium4 -mfpmath=sse'
 ./configure --prefix=$HOME/local --disable-largefile

 $HOME/local/sbin/varnishd -a *:80 -f $HOME/local/etc/varnish/xxx.vcl -p
 obj_workspace=4096 -s malloc,100M -p thread_pools=`grep -c 'model name'
 /proc/cpuinfo`

 backend fd1
 {
   .host=localhost;
   .port=85;
   .connect_timeout = 1s;
   .first_byte_timeout = 20s;
   .between_bytes_timeout =2s;
   .probe = {
  .request =
 GET /xxx HTTP/1.1
 Host: yyy
 Connection: close;
  .timeout = 0.5 s;
  .window = 8;
  .threshold = 3;
   }
 }

 director fdcluster round-robin
 {
   { .backend = fd1; }
 }

 Restarted Varnish, now it works again.

 Any idea?  Anything wrong in the command line?

 Thanks,
 Fernando

Re: only 503s after a few hours (trunk-4384)

2009-12-09 Thread Poul-Henning Kamp
In message 002b01ca78d6$a7de2b10$7802a...@interactive.live.net, Fernando Pap
a writes:
Happened again.  This time I left varnishlog running, and got...

5 Debug- Too many open files when accept(2)ing. Sleeping.
EOF

That's your trouble:  The kernel does not allow varnish to open
as many filedescriptors as it needs.

Use whatever means your kernel has to change this number, possibly
ulimit(1) or similar.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc