[
https://issues.apache.org/jira/browse/SOLR-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533820#comment-17533820
]
Michael Gibney commented on SOLR-15660:
---------------------------------------
Thanks for your response, Mark! I'm not sure I understand what you're
suggesting as your preferred course of action though.
For my own part, my main concern is actually not "most devs" (for whatever
reason this doesn't actually present much of a practical issue for me in local
development), but rather the automated builds, which fail several times a day
with a stacktrace on, e.g.
{{java.util.concurrent.ThreadPoolExecutor.tryTerminate}} (Btw apologies for
spamming the list while I edited the above comment until I figured out that I
had to percent-encode the pipe character in the lists.apache.org url; the
relevant link is
[this|https://lists.apache.org/[email protected]:dfr=2022-4-1%7Cdto=2022-5-31:[email protected]/java.util.concurrent.ThreadPoolExecutor.tryTerminate]).
To take the stacktrace discussed above blocked in
{{ZkStateReader.createClusterStateWatchersAndUpdate}}: I actually think the
threads are neither deadlocked _nor_ waiting on something slow that holds the
lock. The leaked threads in some (most?) cases -- 1 from a build, and 1 that I
was able to reproduce locally -- were all identical, blocked on that
synchronized method. I think they just got caught by thread leak detection in
the fraction of a second between test method return and when they would
naturally have died, with no ill effect.
This is a good case in fact to illustrate the tradeoff, because it _was_ a
threadleak, and arguably would indeed be better off fixed. (Again, I'm fairly
certain the problem is that {{ZooKeeper.close()}} does not wait for connection
threads to terminate; fixed with
[apache/solr#842|https://github.com/apache/solr/pull/842]). I'm all for
catching and fixing such cases, even if I also think the practical consequence
(at least in this case) is likely insignificant. (caveat/tangent: introducing
blocking ZooKeeper.close() in automatic Zk reconnection attempts seems to make
a significant difference under the hood, sometimes blocking for upwards of 10s,
with no ill effect on the overall test suite. It's possible that the change
could be significant (and perhaps beneficial, in ways that I don't yet
understand?), even if I truly think it is practically insignificant in the
end-of-test "ThreadLeakDetection" context where it was initially detected).
But there is literally no way to "fix" the {{tryTerminate}} case (link to
builds mentioned above) -- it's not broken. That was the point of the link to
the MAHOUT issue. And the negative practical consequences of being so strict by
default are significant:
# people tune out build failures because there are enough failures (many of the
classMethod-level, inscrutable failures) that the noise drowns out more
meaningful signals
# developers spend time troubleshooting issues that end up being either
unfixable/non-issue ({{tryTerminate}}) or of questionable practical
significance (async {{ZooKeeper.close()}} connection thread death).
# at a "meta" level, people eventually come to the conclusion that "tests are
flaky", to the detriment of the project's reputation, externally and internally
I guess as a way forward I'd suggest: sure, 10s is probably too long. But
something shorter (1s? 250ms?) probably makes sense as a default. It would be
great if there were a way to have this be configurable by project properties.
I'd be really curious to know, if we zeroed out ThreadLeakLinger across the
board, how many of the resulting issues would end up being of the completely
spurious, {{tryTerminate}}, variety. I'm sure I'm missing something, but in
principle one could even inspect leaked thread stack traces and automatically
prune false positives?
> Remove universal 10 second test thread leak linger.
> ---------------------------------------------------
>
> Key: SOLR-15660
> URL: https://issues.apache.org/jira/browse/SOLR-15660
> Project: Solr
> Issue Type: Test
> Components: Tests
> Reporter: Mark Robert Miller
> Assignee: Mark Robert Miller
> Priority: Minor
> Fix For: 9.0
>
> Attachments: screenshot-1.png
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]