[
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202822#comment-14202822
]
James Hardwick commented on SOLR-6707:
--------------------------------------
Interesting clusterstate.json in ZK. Why would we have null range/parent
properties for an implicitly routed index that has never been split?
{code:javascript}
{
"appindex":{
"shards":{"shard1":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
"core_node1":{
"state":"active",
"core":"appindex",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search"},
"core_node2":{
"state":"active",
"core":"appindex",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search",
"leader":"true"},
"core_node3":{
"state":"active",
"core":"appindex",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search"}}}},
"router":{"name":"implicit"}},
"app-analytics":{
"shards":{"shard1":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
"core_node1":{
"state":"recovery_failed",
"core":"app-analytics",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search"},
"core_node2":{
"state":"recovery_failed",
"core":"app-analytics",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search"},
"core_node3":{
"state":"down",
"core":"app-analytics",
"node_name":"xxx.xxx.xxx.xxx:8081_app-search",
"base_url":"http://xxx.xxx.xxx.xxx:8081/app-search",
"leader":"true"}}}},
"router":{"name":"implicit"}}}
{code}
> Recovery/election for invalid core results in rapid-fire re-attempts until
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-6707
> URL: https://issues.apache.org/jira/browse/SOLR-6707
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.10
> Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually
> down because it's configs are not yet completely updated for Solr-cloud. This
> was thought to be harmless since it's not currently in use.
> - Solr experienced an "internal server error" supposedly because of "No space
> left on device" even though we appeared to have ~10GB free.
> - Solr immediately went into recovery, and subsequent leader election for
> each shard of each core.
> - Our primary core recovered immediately. Our additional core which was never
> active in the first place, attempted to recover but of course couldn't due to
> the improper configs.
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster
> coordination can no longer play out, and Solr topples over.
> I know this is a bit of an unusual circumstance due to us keeping the dead
> core around, and our quick solution has been to remove said core. However I
> can see other potential scenarios that might cause the same issue to arise.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]