[
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287549#comment-14287549
]
Shalin Shekhar Mangar commented on SOLR-6707:
---------------------------------------------
This happens because even if the replica has run out of disk space, the
'requestrecovery' command sent from the leader is successful. The leader then
puts that replica back into rotation which fails. This keeps repeating. We need
better handling of such exceptions on the leader code path.
> Recovery/election for invalid core results in rapid-fire re-attempts until
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-6707
> URL: https://issues.apache.org/jira/browse/SOLR-6707
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.10
> Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually
> down because it's configs are not yet completely updated for Solr-cloud. This
> was thought to be harmless since it's not currently in use.
> - Solr experienced an "internal server error" supposedly because of "No space
> left on device" even though we appeared to have ~10GB free.
> - Solr immediately went into recovery, and subsequent leader election for
> each shard of each core.
> - Our primary core recovered immediately. Our additional core which was never
> active in the first place, attempted to recover but of course couldn't due to
> the improper configs.
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster
> coordination can no longer play out, and Solr topples over.
> I know this is a bit of an unusual circumstance due to us keeping the dead
> core around, and our quick solution has been to remove said core. However I
> can see other potential scenarios that might cause the same issue to arise.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]