James Hardwick created SOLR-6707:
------------------------------------
Summary: Recovery/election for invalid core results in rapid-fire
re-attempts until /overseer/queue is clogged
Key: SOLR-6707
URL: https://issues.apache.org/jira/browse/SOLR-6707
Project: Solr
Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick
We experienced an issue the other day that brought a production solr server
down, and this is what we found after investigating:
- Running solr instance with two separate cores, one of which is perpetually
down because it's configs are not yet completely updated for Solr-cloud. This
was thought to be harmless since it's not currently in use.
- Solr experienced an "internal server error" I believe due in part to a fairly
new feature we are using, which seemingly caused all cores to go down.
- Solr immediately went into recovery, and subsequent leader election for each
shard of each core.
- Our primary core recovered immediately. Our additional core which was never
active in the first place, attempted to recover but of course couldn't due to
the improper configs.
- Solr then began rapid-fire reattempting recovery of said node, trying maybe
20-30 times per second.
- This in turn bombarded zookeepers /overseer/queue into oblivion
- At some point /overseer/queue becomes so backed up that normal cluster
coordination can no longer play out, and Solr topples over.
I know this is a bit of an unusual circumstance due to us keeping the dead core
around, and our quick solution has been to remove said core. However I can see
other potential scenarios that might cause the same issue to arise.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]