[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378850#comment-15378850 ] Shalin Shekhar Mangar commented on SOLR-9290: - Thanks for reviewing Mark but I already fixed that in the last patch. I found a test failure in ZkControllerTest because of a thread leak so I may post another patch soon. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > SOLR-9290.patch, SOLR-9290.patch, index.sh, setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378846#comment-15378846 ] Mark Miller commented on SOLR-9290: --- Patch looks okay to me. {noformat} +clientConnectionManager.shutdown(); +IOUtils.closeQuietly(defaultClient); {noformat} Not that it likely matters, but I'd reverse this and shut down the pool after the client. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > SOLR-9290.patch, SOLR-9290.patch, index.sh, setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377991#comment-15377991 ] Mark Miller commented on SOLR-9290: --- Okay, I didn't catch you were not removing the stale check. bq. For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. I think I remember reading the SSL handles connection lifecycle differently, based on the SSL spec. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377355#comment-15377355 ] Shalin Shekhar Mangar commented on SOLR-9290: - [~markrmil...@gmail.com] -- But you were trying to remove the stale check and disable Nagle's algorithm as well which exposed you to the NoHttpResponseExceptions. We aren't trying to do that here. We just want to close the idle connections so that they don't keep accumulating in the CLOSE_WAIT state. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377345#comment-15377345 ] Mark Miller commented on SOLR-9290: --- bq. why not just re-use the IdleConnectionEvictor class provided by httpcomponents I've gone down this road. It's not a great solution. This is why we ended up changing to the new API's instead in SOLR-4509. Just having an evictor thread is not enough - you also then want the ability to check connections before use if they have been sitting in the pool too long and that requires HttpClient changes they made in the new API's. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377292#comment-15377292 ] Shalin Shekhar Mangar commented on SOLR-9290: - Hmm, you're right Yonik. But we've always had an idle timeout for the http connector in jetty set to 50 seconds (I traced this back to SOLR-128). So after 50 seconds of inactivity, Jetty closes that connection from its end and the client's socket goes to CLOSE_WAIT state. As you said, this connection cannot be re-used anymore. When httpclient tries to use the connection, it does the stale check, sees the CLOSE_WAIT state and terminates the connection and gives a new one to Solr. So all the connections that suddenly do not show up in CLOSE_WAIT and we assumed that they went to ESTABLISHED state were actually terminated. So in summary, our assumption that connections in CLOSE_WAIT are kept around because of re-use is wrong but it still doesn't change the solution that I've proposed. We could also think of increasing the value of Jetty's idle timeout as a separate change but the idle eviction thread would still be necessary. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376974#comment-15376974 ] Yonik Seeley commented on SOLR-9290: I haven't been following this issue, but this caught my eye: bq. Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later It's been a *long* time since I messed around with making sure Solr worked with persistent connections (we're talking CNET days... 2004,2005 ;-) But CLOSE_WAIT is when one side has closed the connection... there's no going back to ESTABLISHED from that state (i.e. no reusing that connection). > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376893#comment-15376893 ] Shalin Shekhar Mangar commented on SOLR-9290: - Hoss has covered most of the things but just a few comments (note that I'm responding to multiple people and comments here): bq. Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? We could backport SOLR-4509 to 6.x and deal with the incompatible changes but I'd certainly not backport it to 5x because it is just a huge change and I am not comfortable releasing that in a minor bug-fix release. I am sure many people running 5.x releases would also like a fix to this issue. Adding an idle eviction thread is trivial and unlikely to cause any regressions. {quote} Shalin Shekhar Mangar: why not just re-use the IdleConnectionEvictor class provided by httpcomponents (getting the exact same underlying impl as what master gets from HttpClientBuilder.evictIdleConnections) ? https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html {quote} I wasn't aware of this class. But looking deeper, I see that it requires a HttpClientConnectionManager instance but the 6.x and 5.x code uses the deprecated PoolingClientConnectionManager which extends ClientConnectionManager. But now that we know it exists, I can just borrow it from the httpclient project instead of writing my own evictor. It is ASLv2 anyway. bq. Somebody sanity check my understanding / summary description of the root issue... That sounds about right to me Hoss. Thanks for the summary! bq. For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. Perhaps the SSL setup/teardown overhead adds some latency such that concurrent requests end up opening more connections overall? I am just guessing here. bq. Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch? As Shai said, this is likely the HttpShardHandler's pool. The overseer collection processor invokes a core admin create for each replica in parallel so you get 15 connections for 15 replicas that were created by the collection API. I'm working on a new patch which applies on branch_6x that incorporates Shai's comments as well. We can then backport it to 5x. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376474#comment-15376474 ] Shai Erera commented on SOLR-9290: -- bq. Which begs the question: why are there 15 CLOSE_WAIT connections that last forever on branch_6x even with this patch? I think Shalin's patch only adds this monitor thread to {{UpdateShardHandler}}, but not to {{HttpShardHandlerFactory}} so these 15 could be from it? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, > setup-solr.sh, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375885#comment-15375885 ] Hoss Man commented on SOLR-9290: Somebody sanity check my understanding / summary description of the root issue... * Solr's use of HttpClient for intra-node communication has historically always had the potential to result in connections sitting "idle" (ie: in a CLOSE_WAIT state) for possible re-use later -- but these connections are kept open indefinitely. ** For reasons I don't understand, 'idle' connections are more likely to (exist? | be kept around indefinitely?) when the intra-node communication is over SSL. * {{maxUpdateConnections}} and {{maxUpdateConnectionsPerHost}} have always set hard upper limits on the number of connections that could ever be created -- let alone in sitting idle in a CLOSE_WAIT state. * Prior to SOLR-8533, the default values for these limits was relatively low, making it unlikely that users could ever observe an extreme # of idle / CLOSE_WAIT threads -- you were more likely to have your Solr cluster crash from deadlocks then notice any serious OS level problem with too many idle connections * After SOLR-8533, the increased default values of these limits made the problem much more noticeable * SOLR-4509's changes included use of a new option which results in a background thread checking for an existing idle connections on the master branch * This issue address the problem for branch_6x (and older) branches via a similar background thread > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375821#comment-15375821 ] Hoss Man commented on SOLR-9290: I'm no expert but... bq. I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? ... ...my understanding is yes: In a situation where indexing load spikes up, you can get a lot of connections which are never completely closed. (even if they are never needed anymore) bq. ...If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? ... again, IIUC: because they are pooled connections maintained by the HTTP layer. Per the docs shalin quoted, clients are required to call ClientConnectionManager#closeExpiredConnections() if they (ie: "we") want to ensure those connections get closed properly. bq. It sounds like master doesn't have this problem because of different client settings? ... Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? master & branch_6x (and earlier) use completely diff http client APIs (see SOLR-4509) ... the {{HttpClientBuilder.evictIdleConnections}} method shalin refered to being used on master is on a class ({{HttpClientBuilder}}) that is not used at all in branch_6x. The docs of that method describe it doing virtually the same exact same thing on the (private connection pool for the) HttpClient as what Shalin's patch does (on the pool in the shared ClientConnectionManager) ... {noformat} Makes this instance of HttpClient proactively evict idle connections from the connection pool using a background thread. {noformat} Which makes me wonder... [~shalinmangar]: why not just re-use the {{IdleConnectionEvictor}} class provided by httpcomponents (getting the exact same underlying impl as what master gets from {{HttpClientBuilder.evictIdleConnections}}) ? https://hc.apache.org/httpcomponents-client-4.4.x/httpclient/apidocs/org/apache/http/impl/client/IdleConnectionEvictor.html > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375727#comment-15375727 ] Scott Lindner commented on SOLR-9290: - I would like to add something, too. The problem must stem from some sort of OS-level setting. In our environment I've noticed that when a given IP+PORT combo reaches ~28k connections in a CLOSE_WAIT state that the OS, itself, cannot allow any more connections to that IP+PORT combo (i.e. even curl fails to that combo - but to other combos, including other ports on that same host - it works just fine). I mention this because the problem seems related here to whatever settings we configure solr to use and you really must change these things in combination for it to ultimately make sense or you risk hitting this problem at some point - though admittedly with the bg thread it wouldn't be permanent like it is for us today. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375666#comment-15375666 ] David A. Bradley commented on SOLR-9290: I don't understand why the preferred approach here is to just have a thread that is trying to close connections. Is the problem that these connections would never otherwise be closed? If that is the case, why can't we solve the problem of them not being closed as a part of their normal usage? It sounds like master doesn't have this problem because of different client settings? : "Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle connections by calling HttpClientBuilder#evictIdleConnections with a 50 second limit." Why not backport that and avoid the problem entirely? Is it a different client version in master or something that makes it not that easy? "So we must periodically close such connections once they're idle to avoid the number of such connections increasing to absurd limits." It seems from the discussion here that the problem is hitting a high number of connections, which is only allowed to be so high because we asked for it. What if this thread lags behind enough that the connections get too high? It sounds like the purpose of this thread is to race to prevent Solr from doing what we asked it to do. The idea of having a thread to deal with any connections that end up in a bad state unexpectedly makes sense, but is the cause of all these CLOSE_WAIT connections really from unexpected behavior? I feel like I must be missing something. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375601#comment-15375601 ] Shai Erera commented on SOLR-9290: -- Oh I see. So we didn't experience the problem because we run w/ 2 replicas (and one shard currently) and with 5.4.1's settings the math for us results in a low number of connections. But someone running a larger Solr deployment could already hit that problem prior to 5.5. Thanks for the clarification! > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375589#comment-15375589 ] Shalin Shekhar Mangar commented on SOLR-9290: - bq. I was asking about 5.3.2 – how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1? 5.3.2 has maxConnectionsPerHost=100 for updates and maxConnectionsPerHost=20 for queries. So on a leader you may have 100*replicationFactor+20*numShards*replicationFactor connections. For a large cluster with many shards and replicas, the overall number of such connections can be quite high. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375582#comment-15375582 ] Shai Erera commented on SOLR-9290: -- Regarding the patch, the monitor looks good. Few comments: * I prefer that we name it {{IdleConnectionsMonitor}} (w/ 's', plural connections). It goes for the class, field and thread name. * Do you intend to keep all the log statements around? * Do you think we should make the polling interval (10s) and idle-connections-time (50s) configurable? Perhaps through system properties? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375571#comment-15375571 ] Shai Erera commented on SOLR-9290: -- bq. Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in? Yes, we do have only 2 replicas and I get why the CLOSE_WAITs stop at 100. I was asking about 5.3.2 -- how could CLOSE_WAITs get high in 5.3.2 when maxConnectionsPerHost was the same as in 5.4.1? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375560#comment-15375560 ] Shalin Shekhar Mangar commented on SOLR-9290: - bq. I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it? Sorry [~shaie], I noticed that after uploaded. I have uploaded the right patch now. Please review. bq. (1) Can/Should we have a similar monitor for HttpShardHandlerFactory? I think so. This patch was only for my tests. bq. Any reason why the two don't share the same HttpClient instance? Hmm. I think originally the idea was to keep the pools for indexing and querying separate but now that the limit (for updates) is so high, I wonder if it still makes sense. I mean, yes you can deadlock a distributed search because of high indexing and vice-versa if you share the pool but if you ever reach the high limit of 100,000 connections, you have more serious problems in the cluster anyway. bq. I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader) Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 100 is kicking in? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375552#comment-15375552 ] Shai Erera commented on SOLR-9290: -- bq. I thought that hypothesis holds only after SOLR-8533. Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely do not see the problem with 5.4.1, but we didn't test prior versions. We posted at the same time, I read your answer above. I wonder why we don't see the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 (200 for the leader). > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375548#comment-15375548 ] Shai Erera commented on SOLR-9290: -- Thanks [~shalinmangar]. Few questions: bq. Also, I think the reason this wasn't reproducible on master is because SOLR-4509 enabled eviction of idle threads by calling HttpClientBuilder#evictIdleConnections with a 50 second limit. Is this something we can apply to 5x/6x too? bq. This patch adds a monitor thread for the pool created in UpdateShardHandler and with this applied I didn't see the monitor in the latest patch, only the log printouts. Did you forget to add it? bq. There are still a few connections in CLOSE_WAIT at steady state but I verified that they belong to a different HttpClient instance in HttpShardHandlerFactory and other places. (1) Can/Should we have a similar monitor for HttpShardHandlerFactory? (2) Any reason why the two don't share the same HttpClient instance? bq. This patch applies on 5.3.2 bq. We have a large limit for maxConnections and maxConnectionsPerHost I thought that hypothesis holds only after SOLR-8533. Are you saying you also saw it on 5.3.2? If so, what are the values that are set for these properties there? We definitely *do not* see the problem with 5.4.1, but we didn't test prior versions. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375541#comment-15375541 ] Shalin Shekhar Mangar commented on SOLR-9290: - {quote} Shalin Shekhar Mangar mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more. Shalin, can you confirm if you were running your tests in stock Solr ? {quote} Actually it is 5.3.2 with some kerberos patches but the client which originally reported the issue was using stock 5.3.2. I don't think the changes are relevant. I believe this was a problem all along. It just got amplified with SOLR-8533 in 5.5.x because now the limit is higher. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, > setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375534#comment-15375534 ] Anshum Gupta commented on SOLR-9290: [~shalinmangar] mentioned that he's able to reproduce this in 5.3.2 as well, which was without SOLR-8533 so we certainly need to look at this more. Shalin, can you confirm if you were running your tests in stock Solr ? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375500#comment-15375500 ] Shai Erera commented on SOLR-9290: -- Thanks [~yo...@apache.org], I'll read the issue. I agree with what you write in general, but we do hit an issue with these settings. That that it reproduces easily with SSL enabled suggests that the issue may not be in Solr code at all, but I wonder if we shouldn't perhaps pick smaller default values if SSL is enabled? (Our guess at the moment is that HC keeps more connections in the pool when SSL is enabled because they are more expensive to initiate, but it's just a guess). And maybe the proper solution would be what [~shalinmangar] wrote above -- have a bg monitor which closes idle/expired connections. I actually wonder why it can't be a property of {{ClientConnectionManager}} that you can set to auto close idle/expired connections after a period of time. We can potentially have that monitor act only if SSL is enabled (or at least until non-SSL exhibits the same problems too). > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375243#comment-15375243 ] Yonik Seeley commented on SOLR-9290: bq. if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it. SOLR-683 Same logic applies to any internal general purpose thread pools or connection pools / connection limits. Think of acquiring a thread like acquiring a lock. If there are going to be a limited number of resources, then one needs to be very careful under what circumstances those resources can be acquired. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375184#comment-15375184 ] Shai Erera commented on SOLR-9290: -- Also [~markrmil...@gmail.com], for education purposes, if you have a link to a discussion about why it may lead to a distributed deadlock, I'd be happy to read it. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375151#comment-15375151 ] Shai Erera commented on SOLR-9290: -- Thanks [~markrmil...@gmail.com]. In that case, what's your take on the issue at hand? > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375117#comment-15375117 ] Mark Miller commented on SOLR-9290: --- The defaults need to be very high to avoid distributed deadlock. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375082#comment-15375082 ] Shai Erera commented on SOLR-9290: -- An update -- I've modified our solr.xml (which is basically the vanilla solr.xml) with these added props (under the {{solrcloud}} element) and I do not see the connections spike anymore: {noformat} 1 100 {noformat} Those changes were part of SOLR-8533. [~markrmil...@gmail.com] on that issue you didn't explain why the defaults need to be set that high. Was there perhaps an email thread you can link to which includes more details? I ask because one thing I've noticed is that if I query {{solr/admin/info/system}}, the {{system.openFileDescriptorCount}} is very high when there are many CLOSE_WAITs. Such a change in Solr default probably need to be accompanied by an OS-level setting too, no? I am still running tests with those props set in solr.xml, on top of 5.5.1. [~mbjorgan] would you mind testing in your environment too? [~hoss...@fucit.org], sorry I completely missed your questions. Our solr.xml is the vanilla one, we didn't modify anything in it. We did uncomment the SSL props in solr.in.sh as the ref guide says, but aside from the key name and password, we didn't change any settings. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374779#comment-15374779 ] Shai Erera commented on SOLR-9290: -- bq. Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped. I've observed that too and it's not that they decrease, but rather that the connections change their state from CLOSE_WAIT to ESTABLISHED, then when indexing is done to TIME_WAIT and then finally to CLOSE_WAIT again. I believe this aligns with what the HC documentation says -- the connections are not necessarily released, but kept in the pool. When you re-index again, they are reused and go back to the pool. bq. However, this commit only increases the limits on how many update connections that can be open That's interesting and might be a temporary workaround for the problem, which I intend to test shortly. In 5.4.1 they were both modified to 100,000: {noformat} - public static final int DEFAULT_MAXUPDATECONNECTIONS = 1; - public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100; + public static final int DEFAULT_MAXUPDATECONNECTIONS = 10; + public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 10; {noformat} This can explain why we run into trouble with 5.5.1 but not with 5.4.1. Though even in 5.4.1 there are few hundreds of CLOSE_WAIT connections, with 5.5.1 they reach (in our case) the orders of 35-40K, at which point Solr became useless, not being able to talk to the replica or pretty much anything else. I see these can be defined in solr.xml, though it's not documented how, so I'm going to give it a try and will report back here. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374478#comment-15374478 ] Mads Tomasgård Bjørgan commented on SOLR-9290: -- I performed a bisect, yielding some commit fo 5.4.1 as good, and a commit from 5.5.3 as bad. This gave the following commit: ad9b87a7285e444cd61fffb83c0aee06c8f7cef0, as the first bad commit. However, this commit only increases the limits on how many update connections that can be open. Thus - this problem affects version 5.4.1 aswell - but is harder to see as Solr isn't allowed to use that many connections when updating. I built Solr on top of the last commit from branch_5_4 (7d52c2523c7a4ff70612742b76b934a12b493331), and implemented the commit that was supposed to be bad, and ended up with the same CLOSE_WAIT-leak. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > Attachments: SOLR-9290-debug.patch, setup-solr.sh > > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373452#comment-15373452 ] Shalin Shekhar Mangar commented on SOLR-9290: - It is reproducible very easily on stock solr with SSL enabled. My test setup creates two SSL-enabled Solr instances with a 5 shard x 2 replica collection and runs a short indexing program (just 9 update requests with 1 document each and a commit at the end). Keep on running the indexing program repeatedly and the number of connections in the CLOSE_WAIT state gradually increase. Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped. I can reproduce the problem on 6.1, 6.0, 5.5.1, 5.3.2. I am not able to reproduce this on master although I don't see anything relevant that has changed since 6.1 -- I tried this only once so it may have just been bad timing? When the connections show in CLOSE_WAIT state, the recv-q buffer always has exactly 70 bytes. {code} netstat -tonp | grep CLOSE_WAIT | grep java tcp 70 0 127.0.0.1:56538 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47995 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47477 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:47996 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:56644 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) tcp 70 0 127.0.0.1:56533 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0) ... {code} If I run the same steps with SSL disabled then the connections in CLOSE_WAIT state have just 1 byte in recv-q. I don't see the number of such connections increasing with indexing over time but I know for a fact (from a client) that eventually more and more connections pile up in this state even without SSL. {code} tcp 1 0 127.0.0.1:41723 127.0.1.1:8983 CLOSE_WAIT 2522/javaoff (0.00/0/0) tcp 1 0 127.0.0.1:41780 127.0.1.1:8983 CLOSE_WAIT 2640/javaoff (0.00/0/0) ... {code} I enabled debug logging for PoolingHttpClientConnectionManager (used in 6.x) and PoolingClientConnectionManager (used in 5.x.x) and after running the indexing program and verifying that some connections are in CLOSE_WAIT, I grepped the logs for connections leased vs released and I always find the number to be the same which means that the connections are always given back to the pool. Now some connections hanging around in CLOSE_WAIT are to be expected because of the following (quoted from the httpclient documentation): {quote} One of the major shortcomings of the classic blocking I/O model is that the network socket can react to I/O events only when blocked in an I/O operation. When a connection is released back to the manager, it can be kept alive however it is unable to monitor the status of the socket and react to any I/O events. If the connection gets closed on the server side, the client side connection is unable to detect the change in the connection state (and react appropriately by closing the socket on its end). HttpClient tries to mitigate the problem by testing whether the connection is 'stale', that is no longer valid because it was closed on the server side, prior to using the connection for executing an HTTP request. The stale connection check is not 100% reliable. The only feasible solution that does not involve a one thread per socket model for idle connections is a dedicated monitor thread used to evict connections that are considered expired due to a long period of inactivity. The monitor thread can periodically call ClientConnectionManager#closeExpiredConnections() method to close all expired connections and evict closed connections from the pool. It can also optionally call ClientConnectionManager#closeIdleConnections() method to close all connections that have been idle over a given period of time. {quote} I'm going to try adding such a monitor thread and see if this is still a problem. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371758#comment-15371758 ] Hoss Man commented on SOLR-9290: questions specifically for [~shaie] followng up on comments made in the mailing list thread mentioned in the isue summary... {quote} When it does happen, the number of CLOSE_WAITS climb very high, to the order of 30K+ entries in 'netstat'. ... When I say it does not reproduce on 5.4.1 I really mean the numbers don't go as high as they do in 5.5.1. Meaning, when running without SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I would separately like to understand why we have any in that state at all). When running with SSL and 5.4.1, they stay low at the order of hundreds the most. {quote} * Does this only reproduce in your application, with your customized configs of Solr, or can you reproduce it using something trivial like "modify bin/solr.in.sh to point at an SSL cert, then run; {{bin/solr -noprompt -cloud}}." ? * Does the problem only manifest solely with indexing, or with queries as well? ie... ** assuming a pre-built collection, and then all nodes restarted, does hammering the cluster with read only queries manifest the problem? ** assuming a virgin cluster with no docs, does hammering the cluster w/updates but never any queries, manifest the problem? * Assuming you start by bringing up a virgin cluster and then begin hammering it with whatever sequences of requests are needed to manifest the problem, how long do you have to wait before the number of CLOSE_WAITS spikes high enough that you are reasonably confident the problem has occured? The last question being a pre-req to wondering if we can just git bisect to identify where/when the problem originated. Even if writing a (reliable) bash automation script (to start the cluster, _and_ triggering requests, _and_ monitoring the CLOSE_WAITS to see if they go over a specified threshold in under a specified timelimit, _and_ shut everything down cleanly) is too cumbersome to have faith in running an automated {{git bisect run test.sh}}, we could still consider doing some manually driven git bisection to try and track this down, as long as each iteration doesn't take very long. Specifically: {{git merge-base}} says ffadf9715c4a511178183fc1411b18c1701b9f1d is the common ancestor for 5.4.1 and 5.5.1, and {{git log}} says there are 487 commits between that point and the 5.5.1 tag. Statistically speaking it should only take ~10 iterations to do a binary search of those commits to find the first problematic one. Assuming there is a manual process someone can run on a clean git checkout of 5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a lot of context switching while they do their day job could be... # running a command to spin up the cluster & client hammering code # setting a 10 minute timer # when the timer goes off, check the results of another command to count the CLOSE_WAITS # {{git bisect good/bad}} # repeat ...and within ~2-3 hours should almost certainly have tracked down when/where the problem started. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#633
[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367313#comment-15367313 ] Johannes Meyer commented on SOLR-9290: -- We have the same issue on Solr 6.1.0 > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > -- > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.1, 5.5.2 >Reporter: Anshum Gupta >Priority: Critical > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org