[
https://issues.apache.org/jira/browse/SOLR-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188326#comment-14188326
]
Mark Miller commented on SOLR-6631:
-----------------------------------
bq. if (eventType == Event.EventType.NodeChildrenChanged) {
+1 - we are only interested in waiting around to see a child added - this
watcher should not need to consider other events.
> DistributedQueue spinning on calling zookeeper getChildren()
> ------------------------------------------------------------
>
> Key: SOLR-6631
> URL: https://issues.apache.org/jira/browse/SOLR-6631
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Jessica Cheng Mallet
> Assignee: Timothy Potter
> Labels: solrcloud
> Attachments: SOLR-6631.patch
>
>
> The change from SOLR-6336 introduced a bug where now I'm stuck in a loop
> making getChildren() request to zookeeper with this thread dump:
> {quote}
> Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
> java.lang.Object.wait()
> org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record,
> ZooKeeper$WatchRegistration)
> org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
> org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
> org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher,
> boolean)
> org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
> org.apache.solr.cloud.DistributedQueue.getChildren(long)
> org.apache.solr.cloud.DistributedQueue.peek(long)
> org.apache.solr.cloud.DistributedQueue.peek(boolean)
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
> java.lang.Thread.run()
> {quote}
> Looking at the code, I think the issue is that LatchChildWatcher#process
> always sets the event to its member variable event, regardless of its type,
> but the problem is that once the member event is set, the await no longer
> waits. In this state, the while loop in getChildren(long), when called with
> wait being Integer.MAX_VALUE will loop back, NOT wait at await because event
> != null, but then it still will not get any children.
> {quote}
> while (true) \{
> if (!children.isEmpty()) break;
> watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
> if (watcher.getWatchedEvent() != null)
> \{ children = orderedChildren(null); \}
> if (wait != Long.MAX_VALUE) break;
> \}
> {quote}
> I think the fix would be to only set the event in the watcher if the type is
> not None.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]