[
https://issues.apache.org/jira/browse/SOLR-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200061#comment-16200061
]
Shalin Shekhar Mangar commented on SOLR-11445:
----------------------------------------------
I think it is better that we explicitly check for NoNode or NodeExists
exceptions in the isBadMessageOrInvalidState() method. Most other
KeeperExceptions shouldn't cause us to poll items off the queue. Also, the same
kind of handling should be done for exceptions thrown when processing messages
from state update queue.
> Overseer.processQueueItem().... zkStateWriter.enqueueUpdate might ideally
> have a try{}catch{} around it
> --------------------------------------------------------------------------------------------------------
>
> Key: SOLR-11445
> URL: https://issues.apache.org/jira/browse/SOLR-11445
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 6.6.1, 7.0, master (8.0)
> Reporter: Greg Harris
> Attachments: SOLR-11445.patch
>
>
> So we had the following stack trace with a customer:
> 2017-10-04 11:25:30.339 ERROR (xxxx) [ ] o.a.s.c.Overseer Exception in
> Overseer main queue loop
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /collections/xxxx/state.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
> at
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
> at
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
> at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:235)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:199)
> at java.lang.Thread.run(Thread.java:748)
> I want to highlight:
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
> This ends up coming from Overseer:
> while (data != null) {
> final ZkNodeProps message = ZkNodeProps.load(data);
> log.debug("processMessage: workQueueSize: {}, message = {}",
> workQueue.getStats().getQueueLength(), message);
> // force flush to ZK after each message because there is no
> fallback if workQueue items
> // are removed from workQueue but fail to be written to ZK
> *clusterState = processQueueItem(message, clusterState,
> zkStateWriter, false, null);
> workQueue.poll(); // poll-ing removes the element we got by
> peek-ing*
> data = workQueue.peek();
> }
> Note: The processQueueItem comes before the poll, therefore upon a thrown
> exception the same node/message that won't process becomes stuck. This made a
> large cluster unable to come up on it's own without deleting the problem
> node.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]