[ 
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218286#comment-15218286
 ] 

Noble Paul commented on SOLR-8914:
----------------------------------

looked at the patch .The block 

{code:java}
  synchronized (refreshLiveNodesLock) {
      Set<String> newLiveNodes;
      try {
        List<String> nodeList = zkClient.getChildren(LIVE_NODES_ZKNODE, 
watcher, true);
        newLiveNodes = new HashSet<>(nodeList);
      } catch (KeeperException.NoNodeException e) {
        newLiveNodes = emptySet();
      }
      lastFetchedLiveNodes.set(newLiveNodes);
    }

    // Can't lock getUpdateLock() until we release the other, it would cause 
deadlock.
    Set<String> oldLiveNodes, newLiveNodes;
    synchronized (getUpdateLock()) {
      newLiveNodes = lastFetchedLiveNodes.getAndSet(null);
      if (newLiveNodes == null) {
        // Someone else won the race to apply the last update, just exit.
        return;
      }

      oldLiveNodes = this.liveNodes;
      this.liveNodes = newLiveNodes;
      if (clusterState != null) {
        clusterState.setLiveNodes(newLiveNodes);
      }
    }
{code}

Why can't  the {{synchronized (refreshLiveNodesLock)}} block be applied on the 
entire method and make the code simpler and avoid the race condition 
altogether.  

> ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
> ------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, 
> jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, 
> live_node_mentions_port56361_with_threadIds.log.txt, 
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the 
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a 
> MiniSolrCloudCluster & several clients are initialized -- before any code 
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in 
> the logs, so i'm filing this jira with my analysis in the hopes that someone 
> else can help make sense of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to