Re: Child panics on OpenSolaris

Poul-Henning Kamp Fri, 12 Feb 2010 10:26:27 -0800

In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul 
Wright writes:


>> I wonder if Solaris has some kind of "I already told you it were
>> closed once" logic...
>
>Here's a snippet from varnishlog for one of these panics (also
>attached to this email in case line wraps wreck formatting):

Interesting!

This time the EBADF comes in the original worker thread, before we
hand the file descriptor over to the waiter, eliminating that entire
ball of wax from the picture.

>  419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
>  42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d]

I can find absolutely no trace of EBADF meaning "remote end closed"
in the Solaris docs or other docs on the web, but that as far as I
can tell that is indeed what happens here.

But as a kernel programmer, I can see where this might come from:

Receiving a TCP-RST means that the socket is never going to be
useful again.  Since you already have the socket/pcb locked, taking
it entirely out of its missery right away is cheap and more efficient,
than waiting for the process to notice and issue a close(2) on it,
and then have to relock the socket/pcb again etc. etc.

Next time you try to use the filedescriptor, there is no socket
and EBADF ensues.

Reasoning that most programs notice the return value, and call
strerror(3) not caring very much what the exact value of errno
is, you can get away with returning EBADF.

Varnish however, is written my a cranky old FreeBSD kernel hacker,
who has no pretentions about writing correct code the first time,
so 10% of the lines are asserts and yes I actually _do_ care about
the specific errno's returned.

And EBADF is not just any errorcode, it is the only errno which
has universally been recognized as meaning "programmer screwed up",
because you can only get it if you muck up your filedescriptors.

Or as one of the first hits Google gave me, when researching this
more politely but no less firmly describes it:

        Bad file number (EBADF): The file descriptor references a
        file that is either not open or is open for a conflicting
        purpose. (eg, a read(2) is specified against a file that
        is open for write(2) or vice-versa.) This is a programming
        bug.

        (http://www.princeton.edu/~unix/Solaris/troubleshoot/error.html)

If I had implemented the hack I suspect Solaris contains, I would
have found some bit somewhere, to make sure the errno would be the
correct, documented and expected:

        #define ECONNRESET      54     /* Connection reset by peer */

Somebody with a Solaris service contract, if such things still
exist, should report this as a bug to them...

I will add a workaround to Varnish, with a suitable sarcastic
commentary...

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
p...@freebsd.org         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.
_______________________________________________
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc

Re: Child panics on OpenSolaris

Reply via email to