Chris M. Hostetter created SOLR-14452:
-----------------------------------------

             Summary: "classloading deadlock" issue with DocSet/SortedIntDocSet
                 Key: SOLR-14452
                 URL: https://issues.apache.org/jira/browse/SOLR-14452
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


While beasting some facet related cloud tests on master, I noticed a pattern of 
occasional failures that seemed to crop up...
 * test ultimately fails due to a time out (usually the client threads time out 
waiting for a server response)
 * if i notice my CPU isn't spinning very hard _before_ the test fails, I can 
capture a jstack and inspect some threads
 * there will be multiple jetty/solr request threads (ex: {{"qtp82184175-145"}} 
whose stack traces show various stages of DocSet collection that show they are 
{{"... in Object.wait()"}} but also {{RUNNABLE}}

...this isn't a thread summary+state combination that i'm use to seeing when 
looking at thread dumps, and some research into when/why this might happen lead 
me to:
 * 
[https://stackoverflow.com/questions/28631656/runnable-thread-state-but-in-object-wait]
 ** [https://stackoverflow.com/a/28776438/689372]
 *** 
 **** 
[http://ternarysearch.blogspot.com/2013/07/static-initialization-deadlock.html]
 **** [https://bugs.openjdk.java.net/browse/JDK-8037567]

...while the comments/status of JDK-8037567 suggests "nothing wrong here" the 
overall symptoms/description of the problem in the SO answer and linked blog 
and summation that this is essentially a "deadlock" situation in the class 
loader, do seem to correlate to some of the specifics I can see in the stack 
traces when this happens while running solr tests...
 * at least one "RUNNABLE / Object.wait" thread trying to do class init; class: 
DocSet...
{noformat}
"qtp1535326437-68" #68 prio=5 os_prio=0 cpu=72.48ms elapsed=241.69s 
tid=0x00007fc08c0a4000 nid=0x864 in Object.wait()  [0x00007fc0adedd000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.solr.search.DocSet.<clinit>(DocSet.java:118)
        at 
org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // 
"new BitDocSet(..)"
        at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
{noformat}

 * other "RUNNABLE / Object.wait" threads are on lines that involve 
instantiating a subclass of DocSet:
 ** 
{noformat}
"qtp1535326437-67" #67 prio=5 os_prio=0 cpu=801.44ms elapsed=241.69s 
tid=0x00007fc08c0a1800 nid=0x863 in Object.wait()  [0x00007fc0adfdf000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // 
"new BitDocSet(..)"
        at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
{noformat}

 ** 
{noformat}
"qtp82184175-65" #65 prio=5 os_prio=0 cpu=137.76ms elapsed=241.69s 
tid=0x00007fc088092000 nid=0x860 in Object.wait()  [0x00007fc0ae2e2000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:84) // 
"new SortedIntDocSet(..)"
        at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
{noformat}

 ** etc...
 * DocSet has a static reference to a concrete subclass...
 ** {{public static final DocSet EMPTY = new SortedIntDocSet(new int[0], 0);

----

I should point out:
* While this particular "class loading deadlock" issue seems more likely to 
happen in a "test" situation where the JVMs/classloaders are short lived, 
there's no reason to assume this type of failure couldn't happen in a 
production solr instance when handling a burst of queries right after startup.
* This type of failure (either specifically due to "DocSet vs SortedIntDocSet", 
or due to similar patterns in other classes) may also be the root cause of 
various other hard to reproduce "timed out" test failures we've seen over the 
years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to