[jira] [Commented] (SOLR-9706) fetchIndex blocks incoming queries when issued on a replica in SolrCloud

Jeremy Hoy (JIRA) Fri, 25 Nov 2016 02:23:14 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695492#comment-15695492
 ]


Jeremy Hoy commented on SOLR-9706:
----------------------------------

We have a setup that has separate indexing and searching clusters.  We do this 
using both solr cloud (primarily for searching) and master slave replication.  
It's worth noting that we ingest into named shards rather than letting solr 
cloud deal with sharding.  So this is achieved by using the normal enable.slave 
and enable.master init arguments, together with solrcloud.skip.autorecovery 
and using the NoOpDistributingUpdateProcessorFactory in the 
updateRequestProcessorChain in solrconfig.  The problem described here bit us 
as the searching slaves were usually blocked when partial replication started 
and would continue to be blocked until the new segments were downloaded; this 
was made worse for us as we monitor that solr is responding to admin ping 
requests and restart solr if that fails or times out a number of times in 
succession, which it was!

The reason the blocking happens is that in the IndexFetcher.fetchLatestIndex 
the searcher is closed if we're running solrcloud (which in our scenario we 
are, kind of!) and the replication is partial.  A new index writer is then 
created which takes an index writer write lock, which in turn blocks the 
creation of a newSearcher so any subsequent requests are blocked until the 
write lock is released when the replication completes.  This behavior was 
introduced as part of SOLR-6640 .  So the behavior is intentional in the sense 
the searcher is (must be) closed to prevent uncommited/flushed files resulting 
in index corruption.

For our situation then an obvious way to fix this is to check the init args to 
see if enabled.slave is set and then not close the searcher.  We could do the 
check using a method in IndexFetcher, setting a private property in the 
IndexFetcher constructor, or possibly adding a public get method for isSlave in 
the ReplicationHandler, or something else.  Not sure what the best approach is, 
but I'm happy to put a patch together, if you have a preference.

In the more general sense, would I be right in suggesting that in an high 
indexing scenario, or indeed post some sort of network partition event, that 
this could pose a real problem in a normal solrcloud setup if the new segment 
files being downloaded are large and either the network isn't particularly 
quick or you have a large number of followers in recovery all pulling files 
from the leader?  I.e there could be a bunch of followers blocked from taking 
search requests and a leader with a saturation network interface servicing all 
searches and delivering segment files to followers.  I'm not sure what the 
right approach to solving this would be.  Would opening a new searcher after 
the unused files have been cleaned up be a feasible way to at least mitigate 
the problem?  It's probably worth noting that with the code as it stands, and 
for our situation at least, it is possible for new searchers to be created 
during this process (it doesn't always block), depending on the timing of 
incoming search requests.

> fetchIndex blocks incoming queries when issued on a replica in SolrCloud
> ------------------------------------------------------------------------
>
>                 Key: SOLR-9706
>                 URL: https://issues.apache.org/jira/browse/SOLR-9706
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.3, trunk
>            Reporter: Erick Erickson
>
> This is something of an edge case, but it's perfectly possible to issue a 
> fetchIndex command through the core admin API to a replica in SolrCloud. 
> While the fetch is going on, incoming queries are blocked. Then when the 
> fetch completes, all the queued-up queries execute.
> In the normal case, this is probably the proper behavior as a fetchIndex 
> during "normal" SolrCloud operation indicates that the replica's index is too 
> far out of date and _shouldn't_ serve queries, this is a special case.
> Why would one want to do this? Well, in _extremely_ high indexing throughput 
> situations, the additional time taken for the leader forwarding the query on 
> to a follower is too high. So there is an indexing cluster and a search 
> cluster and an external process that issues a fetchIndex to each replica in 
> the search cluster periodiclally.
> What do people think about an "expert" option for fetchIndex that would cause 
> a replica to behave like the old master/slave days and continue serving 
> queries while the fetchindex was going on? Or another solution?
> FWIW, here's the stack traces where the blocking is going on (6.3 about). 
> This is not hard to reproduce if you introduce an artificial delay in the 
> fetch command then submit a fetchIndex and try to query.
> Blocked query thread(s)
> DefaultSolrCoreState.loci(159)
> DefaultSolrCoreState.getIndexWriter (104)
> SolrCore.openNewSearcher(1781)
> SolrCore.getSearcher(1931)
> SolrCore.getSearchers(1677)
> SolrCore.getSearcher(1577)
> SolrQueryRequestBase.getSearcher(115)
> QueryComponent.process(308).
> The stack trace that releases this is
> DefaultSolrCoreState.createMainIndexWriter(240)
> DefaultSolrCoreState.changeWriter(203)
> DefaultSolrCoreState.openIndexWriter(228) // LOCK RELEASED 2 lines later
> IndexFetcher.fetchLatestIndex(493) (approx, I have debugging code in there. 
> It's in the "finally" clause anyway.)
> IndexFetcher.fetchLatestIndex(251).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9706) fetchIndex blocks incoming queries when issued on a replica in SolrCloud

Reply via email to