[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-8914: --------------------------- Attachment: SOLR-8914.patch I wrote up a stress test to demonstrate the bug. I've added it to the patch Scott already worked up & attached. Scott: Prior to incorporating your changes, hammering on this stress test would fail within the first 20 attempts. But with your changes I'm seeing deadlocks within the first 5 attempts every time i hammer on it... {noformat} Found one Java-level deadlock: ============================= "zkCallback-7-thread-2-processing-n:127.0.0.1:48312_solr": waiting to lock monitor 0x00007f82d40076b8 (object 0x00000000ff3b5b38, a java.lang.Object), which is held by "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr" "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr": waiting to lock monitor 0x00007f82d400be38 (object 0x00000000ff3b5800, a org.apache.solr.common.cloud.ZkStateReader), which is held by "OverseerStateUpdate-95637266046386179-127.0.0.1:48312_solr-n_0000000000" "OverseerStateUpdate-95637266046386179-127.0.0.1:48312_solr-n_0000000000": waiting to lock monitor 0x00007f82d40076b8 (object 0x00000000ff3b5b38, a java.lang.Object), which is held by "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr" {noformat} > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > ------------------------------------------------------------ > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > Attachments: SOLR-8914.patch, SOLR-8914.patch, > jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend.... > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org