[
https://issues.apache.org/jira/browse/SOLR-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933035#comment-17933035
]
Jason Gerlowski commented on SOLR-17692:
----------------------------------------
Solr makes a pretty good attempt to handle this case, but either it never
targeted full-recovery, or something broke since the code was added.
[CoreContainer.unload|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L2191-L2198],
which does most of the processing for a DELETEREPLICA, makes two method calls
in an attempt to preempt:
*
[core.getSolrCoreState().cancelRecovery()|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391]
- aims to cover recovery "proper", and is most relevant to the case in the
issue description above.
*
[zkSys.getZkController().stopReplicationFromLeader(...)|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1503-L15]
- preempts the periodically-triggered "background" replication done by PULL
and TLOG replicas in the course of normal operation
[DefaultSolrCoreState.cancelRecovery|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391]
calls RecoveryStrategy.close(), which sets a boolean flag that the
RecoveryStrategy checks at various points and can be used to early-exit. This
all works great so far as it goes. The problem though is that the bulk of the
logic for a full-recovery, including the loop to iterate over and fetch index
files from the leader, lives elsewhere (particularly
[ReplicationHandler.doFetch|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/ReplicationHandler.java#L452]
and
[IndexFetcher.fetchLatestIndex|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L405])
and doesn't have access to RecoveryStrategy and its early-exit flag. So in
practice once a full-recovery starts fetching files from the leader, it won't
check RecoveryStrategy's early-exit flag again until it's finished.
Interestingly, IndexFetcher has [its own early-exit
flag|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1569-L1571]
that can be set by a call to IndexFetcher.destroy(), and it is already used
for preemption in the PULL/TLOG replica "background replication" scenario.
This flag [_does_ get
checked|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1747-L1751]
in the "iterate and fetch each leader index file" loop, so it seems like a
really great option for our case, if we can find a way to set it from
[DefaultSolrCoreState.cancelRecovery|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391].
> DELETEREPLICA should preempt full-recovery instead of waiting for completion
> ----------------------------------------------------------------------------
>
> Key: SOLR-17692
> URL: https://issues.apache.org/jira/browse/SOLR-17692
> Project: Solr
> Issue Type: Bug
> Components: replication (java), SolrCloud
> Reporter: Jason Gerlowski
> Priority: Major
>
> I recently deleted a NRT replica that was in the middle of a full-recovery
> and was a bit surprised to see that the "delete" blocked waiting for the
> recovery to finish. This is a minor pain when the index is small, but
> becomes a huge waste of administrator time (and network bandwidth!) as index
> sizes grow.
> There's some plumbing in Solr that attempts to preempt recovery during a
> DELETE, but it appears that it seems that it mostly comes into play during
> peer-sync and "background replication" scenarios (i.e. PULL and TLOG replicas
> that do full-recovery during normal operation). Preemption doesn't seem to
> work once a recovering core is in the midst of a "full recovery". We should
> modify this code that it stops full-recovery as well, unless there's some
> compelling reason this was avoided in the initial implementation?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]