But all too often necessary :)

On Tue, Mar 3, 2015 at 12:14 AM Ramkumar R. Aiyengar <
[email protected]> wrote:

> I agree, sigkill is typically the last resort..
> On 3 Mar 2015 00:49, "Reitzel, Charles" <[email protected]>
> wrote:
>
>>  My bad.  Too long away from sockets since cleaning up those shutdown
>> handlers.  Your point is well taken, on the server side the risks of
>> consuming a stray echo packet are fairly low (but non-zero, if you’ve ever
>> spent any quality time with tcpdump/wireshark).
>>
>>
>>
>> Still, in a production setting, SIGKILL (aka “kill -9”) should be a last
>> resort after more reasonable methods (e.g. SIGINT, SIGTERM, SIGSTOP) have
>> failed.
>>
>>
>>
>> *From:* Ramkumar R. Aiyengar [mailto:[email protected]]
>> *Sent:* Monday, March 02, 2015 7:00 PM
>> *To:* [email protected]
>> *Subject:* RE: reuseAddress default in Solr jetty.xml
>>
>>
>>
>> No, reuseAddress doesn't allow you to have two processes, old and new,
>> listen to the same port. There's no option which allows you to do that.
>>
>> Tl;DR This can happen when you have a connection to a server which gets
>> killed hard and comes back up immediately
>>
>> So here's what happens.
>>
>> When a server normally shuts down, it triggers an active close on all
>> open TCP connections it has. That sends a three way msg exchange with the
>> remote recipient (FIN, FIN+ACK, ACK) at the end of which the socket is
>> closed and the kernel puts it in a TIME_WAIT state for a few minutes in the
>> background (depends on the OS, maximum tends to be 4 mins). This is needed
>> to allow for reordered older packets to reach the machine just in case. Now
>> typically if the server restarts within that period and tries to bind again
>> to the same port, the kernel is smart enough to not complain that there is
>> an existing socket in TIME_WAIT, because it knows the last sequence number
>> it used for the final message in the previous process, and since sequence
>> numbers are always increasing, it can reject any messages before that
>> sequence number as a new process has now taken the port.
>>
>> Trouble is with abnormal shutdown. There's no time for a proper goodbye,
>> so the kernel marks the socket to respond to remote packets with a rude RST
>> (reset). Since there has been no goodbye with the remote end, it also
>> doesn't know the last sequence number to delineate if a new process binds
>> to the same port. Hence by default it denies binding to the new port for
>> the TIME_WAIT period to avoid the off chance a stray packet gets picked up
>> by the new process and utterly confuses it. By setting reuseAddress, you
>> are essentially waiving off this protection. Note that this possibility of
>> confusion is unbelievably miniscule in the first place (both the source and
>> destination host:port should be the same and the client port is generally
>> randomly allocated). If the port we are talking of is a local port, it's
>> almost impossible -- you have bigger problems if a TCP packet is lost or
>> delayed within the same machine!
>>
>> As to Shawn's point, for Solr's stop port, you essentially need to be
>> trying to actively shutdown the server using the stop port, or be within a
>> few minutes of such an attempt while the server is killed. Just the server
>> being killed without any active connection to it is not going to cause this
>> issue.
>>
>> Hi Ram,
>>
>>
>>
>> It appears the problem is that the old solr/jetty process is actually
>> still running when the new solr/jetty process is started.   That’s the
>> problem that needs fixing.
>>
>>
>>
>> This is not a rare problem in systems with worker threads dedicated to
>> different tasks.   These threads need to wake up in response to the
>> shutdown signal/command, as well the normal inputs.
>>
>>
>>
>> It’s a bug I’ve created and fixed a couple times over the years … :-)
>> I wouldn’t know where to start with Solr.  But, as I say, re-using the port
>> is a band-aid.  I’ve yet to see a case where it is the best solution.
>>
>>
>>
>> best,
>>
>> Charlie
>>
>>
>>
>> *From:* Ramkumar R. Aiyengar [mailto:[email protected]]
>> *Sent:* Saturday, February 28, 2015 8:15 PM
>> *To:* [email protected]
>> *Subject:* Re: reuseAddress default in Solr jetty.xml
>>
>>
>>
>> Hey Charles, see my explanation above on why this is needed. If Solr has
>> to be killed, it would generally be immediately restarted. This would
>> normally not the case, except when things are potentially misconfigured or
>> if there is a bug, but not doing so makes the impact worse..
>>
>> In any case, turns out really that reuseAddress is true by default for
>> the connectors we use, so that really isn't the issue. The issue more
>> specifically is that the stop port doesn't do it, so the actual port by
>> itself starts just fine on a restart, but the stop port fails to bind --
>> and there's no way currently in Jetty to configure that.
>>
>> Based on my question in the jetty mailing list, I have now created an
>> issue for them..
>>
>> https://bugs.eclipse.org/bugs/show_bug.cgi?id=461133
>>
>>
>>
>> On Fri, Feb 27, 2015 at 3:03 PM, Reitzel, Charles <
>> [email protected]> wrote:
>>
>> Disclaimer: I’m not a Solr committer.  But, as a developer, I’ve never
>> seen a good case for reusing the listening port.   Better to find and fix
>> the root cause on the zombie state (or just slow shutdown, sometimes) and
>> release the port.
>>
>>
>>
>> *From:* Mark Miller [mailto:[email protected]]
>> *Sent:* Thursday, February 26, 2015 5:28 PM
>> *To:* [email protected]
>> *Subject:* Re: reuseAddress default in Solr jetty.xml
>>
>>
>>
>> +1
>>
>> - Mark
>>
>>
>>
>> On Thu, Feb 26, 2015 at 1:54 PM Ramkumar R. Aiyengar <
>> [email protected]> wrote:
>>
>> The jetty.xml we currently ship by default doesn't set reuseAddress=true.
>> If you are having a bad GC day with things going OOM and resulting in Solr
>> not even being able to shutdown cleanly (or the oom_solr.sh script killing
>> it), whatever external service management mechanism you have is probably
>> going to try respawn it and fail with the default config because the ports
>> will be in TIME_WAIT. I guess there's the usual disclaimer with
>> reuseAddress causing stray packets to reach the restarted server, but
>> sounds like at least the default should be true..
>>
>> I can raise a JIRA, but just wanted to check if anyone has any opinions
>> either way..
>>
>>
>>
>>
>> *************************************************************************
>> This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>>
>> TIAA-CREF
>> *************************************************************************
>>
>>
>>
>>
>> --
>>
>> Not sent from my iPhone or my Blackberry or anyone else's
>>
>>
>> *************************************************************************
>> This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>>
>> TIAA-CREF
>> *************************************************************************
>>
>>
>> *************************************************************************
>> This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>>
>> TIAA-CREF
>> *************************************************************************
>>
>

Reply via email to