[
https://issues.apache.org/jira/browse/SOLR-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695492#comment-15695492
]
Jeremy Hoy commented on SOLR-9706:
----------------------------------
We have a setup that has separate indexing and searching clusters. We do this
using both solr cloud (primarily for searching) and master slave replication.
It's worth noting that we ingest into named shards rather than letting solr
cloud deal with sharding. So this is achieved by using the normal enable.slave
and enable.master init arguments, together with solrcloud.skip.autorecovery
and using the NoOpDistributingUpdateProcessorFactory in the
updateRequestProcessorChain in solrconfig. The problem described here bit us
as the searching slaves were usually blocked when partial replication started
and would continue to be blocked until the new segments were downloaded; this
was made worse for us as we monitor that solr is responding to admin ping
requests and restart solr if that fails or times out a number of times in
succession, which it was!
The reason the blocking happens is that in the IndexFetcher.fetchLatestIndex
the searcher is closed if we're running solrcloud (which in our scenario we
are, kind of!) and the replication is partial. A new index writer is then
created which takes an index writer write lock, which in turn blocks the
creation of a newSearcher so any subsequent requests are blocked until the
write lock is released when the replication completes. This behavior was
introduced as part of SOLR-6640 . So the behavior is intentional in the sense
the searcher is (must be) closed to prevent uncommited/flushed files resulting
in index corruption.
For our situation then an obvious way to fix this is to check the init args to
see if enabled.slave is set and then not close the searcher. We could do the
check using a method in IndexFetcher, setting a private property in the
IndexFetcher constructor, or possibly adding a public get method for isSlave in
the ReplicationHandler, or something else. Not sure what the best approach is,
but I'm happy to put a patch together, if you have a preference.
In the more general sense, would I be right in suggesting that in an high
indexing scenario, or indeed post some sort of network partition event, that
this could pose a real problem in a normal solrcloud setup if the new segment
files being downloaded are large and either the network isn't particularly
quick or you have a large number of followers in recovery all pulling files
from the leader? I.e there could be a bunch of followers blocked from taking
search requests and a leader with a saturation network interface servicing all
searches and delivering segment files to followers. I'm not sure what the
right approach to solving this would be. Would opening a new searcher after
the unused files have been cleaned up be a feasible way to at least mitigate
the problem? It's probably worth noting that with the code as it stands, and
for our situation at least, it is possible for new searchers to be created
during this process (it doesn't always block), depending on the timing of
incoming search requests.
> fetchIndex blocks incoming queries when issued on a replica in SolrCloud
> ------------------------------------------------------------------------
>
> Key: SOLR-9706
> URL: https://issues.apache.org/jira/browse/SOLR-9706
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 6.3, trunk
> Reporter: Erick Erickson
>
> This is something of an edge case, but it's perfectly possible to issue a
> fetchIndex command through the core admin API to a replica in SolrCloud.
> While the fetch is going on, incoming queries are blocked. Then when the
> fetch completes, all the queued-up queries execute.
> In the normal case, this is probably the proper behavior as a fetchIndex
> during "normal" SolrCloud operation indicates that the replica's index is too
> far out of date and _shouldn't_ serve queries, this is a special case.
> Why would one want to do this? Well, in _extremely_ high indexing throughput
> situations, the additional time taken for the leader forwarding the query on
> to a follower is too high. So there is an indexing cluster and a search
> cluster and an external process that issues a fetchIndex to each replica in
> the search cluster periodiclally.
> What do people think about an "expert" option for fetchIndex that would cause
> a replica to behave like the old master/slave days and continue serving
> queries while the fetchindex was going on? Or another solution?
> FWIW, here's the stack traces where the blocking is going on (6.3 about).
> This is not hard to reproduce if you introduce an artificial delay in the
> fetch command then submit a fetchIndex and try to query.
> Blocked query thread(s)
> DefaultSolrCoreState.loci(159)
> DefaultSolrCoreState.getIndexWriter (104)
> SolrCore.openNewSearcher(1781)
> SolrCore.getSearcher(1931)
> SolrCore.getSearchers(1677)
> SolrCore.getSearcher(1577)
> SolrQueryRequestBase.getSearcher(115)
> QueryComponent.process(308).
> The stack trace that releases this is
> DefaultSolrCoreState.createMainIndexWriter(240)
> DefaultSolrCoreState.changeWriter(203)
> DefaultSolrCoreState.openIndexWriter(228) // LOCK RELEASED 2 lines later
> IndexFetcher.fetchLatestIndex(493) (approx, I have debugging code in there.
> It's in the "finally" clause anyway.)
> IndexFetcher.fetchLatestIndex(251).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]