In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul Wright writes:
>> I wonder if Solaris has some kind of "I already told you it were >> closed once" logic... > >Here's a snippet from varnishlog for one of these panics (also >attached to this email in case line wraps wreck formatting): Interesting! This time the EBADF comes in the original worker thread, before we hand the file descriptor over to the waiter, eliminating that entire ball of wax from the picture. > 419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091] > 42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d] I can find absolutely no trace of EBADF meaning "remote end closed" in the Solaris docs or other docs on the web, but that as far as I can tell that is indeed what happens here. But as a kernel programmer, I can see where this might come from: Receiving a TCP-RST means that the socket is never going to be useful again. Since you already have the socket/pcb locked, taking it entirely out of its missery right away is cheap and more efficient, than waiting for the process to notice and issue a close(2) on it, and then have to relock the socket/pcb again etc. etc. Next time you try to use the filedescriptor, there is no socket and EBADF ensues. Reasoning that most programs notice the return value, and call strerror(3) not caring very much what the exact value of errno is, you can get away with returning EBADF. Varnish however, is written my a cranky old FreeBSD kernel hacker, who has no pretentions about writing correct code the first time, so 10% of the lines are asserts and yes I actually _do_ care about the specific errno's returned. And EBADF is not just any errorcode, it is the only errno which has universally been recognized as meaning "programmer screwed up", because you can only get it if you muck up your filedescriptors. Or as one of the first hits Google gave me, when researching this more politely but no less firmly describes it: Bad file number (EBADF): The file descriptor references a file that is either not open or is open for a conflicting purpose. (eg, a read(2) is specified against a file that is open for write(2) or vice-versa.) This is a programming bug. (http://www.princeton.edu/~unix/Solaris/troubleshoot/error.html) If I had implemented the hack I suspect Solaris contains, I would have found some bit somewhere, to make sure the errno would be the correct, documented and expected: #define ECONNRESET 54 /* Connection reset by peer */ Somebody with a Solaris service contract, if such things still exist, should report this as a bug to them... I will add a workaround to Varnish, with a suitable sarcastic commentary... Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc