[ https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371758#comment-15371758 ]
Hoss Man commented on SOLR-9290: -------------------------------- questions specifically for [~shaie] followng up on comments made in the mailing list thread mentioned in the isue summary... {quote} When it does happen, the number of CLOSE_WAITS climb very high, to the order of 30K+ entries in 'netstat'. ... When I say it does not reproduce on 5.4.1 I really mean the numbers don't go as high as they do in 5.5.1. Meaning, when running without SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I would separately like to understand why we have any in that state at all). When running with SSL and 5.4.1, they stay low at the order of hundreds the most. {quote} * Does this only reproduce in your application, with your customized configs of Solr, or can you reproduce it using something trivial like "modify bin/solr.in.sh to point at an SSL cert, then run; {{bin/solr -noprompt -cloud}}." ? * Does the problem only manifest solely with indexing, or with queries as well? ie... ** assuming a pre-built collection, and then all nodes restarted, does hammering the cluster with read only queries manifest the problem? ** assuming a virgin cluster with no docs, does hammering the cluster w/updates but never any queries, manifest the problem? * Assuming you start by bringing up a virgin cluster and then begin hammering it with whatever sequences of requests are needed to manifest the problem, how long do you have to wait before the number of CLOSE_WAITS spikes high enough that you are reasonably confident the problem has occured? The last question being a pre-req to wondering if we can just git bisect to identify where/when the problem originated. Even if writing a (reliable) bash automation script (to start the cluster, _and_ triggering requests, _and_ monitoring the CLOSE_WAITS to see if they go over a specified threshold in under a specified timelimit, _and_ shut everything down cleanly) is too cumbersome to have faith in running an automated {{git bisect run test.sh}}, we could still consider doing some manually driven git bisection to try and track this down, as long as each iteration doesn't take very long. Specifically: {{git merge-base}} says ffadf9715c4a511178183fc1411b18c1701b9f1d is the common ancestor for 5.4.1 and 5.5.1, and {{git log}} says there are 487 commits between that point and the 5.5.1 tag. Statistically speaking it should only take ~10 iterations to do a binary search of those commits to find the first problematic one. Assuming there is a manual process someone can run on a clean git checkout of 5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a lot of context switching while they do their day job could be... # running a command to spin up the cluster & client hammering code # setting a 10 minute timer # when the timer goes off, check the results of another command to count the CLOSE_WAITS # {{git bisect good/bad}} # repeat ...and within ~2-3 hours should almost certainly have tracked down when/where the problem started. > TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled > ------------------------------------------------------------------------------ > > Key: SOLR-9290 > URL: https://issues.apache.org/jira/browse/SOLR-9290 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 5.5.1, 5.5.2 > Reporter: Anshum Gupta > Priority: Critical > > Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT > state. > At my workplace, we have seen this issue only with 5.5.1 and could not > reproduce it with 5.4.1 but from my conversation with Shalin, he knows of > users with 5.3.1 running into this issue too. > Here's an excerpt from the email [~shaie] sent to the mailing list (about > what we see: > {quote} > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > {quote} > Here's the mail thread: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E > Creating this issue so we could track this and have more people comment on > what they see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org