[ 
https://issues.apache.org/jira/browse/SOLR-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933035#comment-17933035
 ] 

Jason Gerlowski commented on SOLR-17692:
----------------------------------------

Solr makes a pretty good attempt to handle this case, but either it never 
targeted full-recovery, or something broke since the code was added.  
[CoreContainer.unload|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L2191-L2198],
 which does most of the processing for a DELETEREPLICA, makes two method calls 
in an attempt to preempt:

* 
[core.getSolrCoreState().cancelRecovery()|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391]
 - aims to cover recovery "proper", and is most relevant to the case in the 
issue description above.
* 
[zkSys.getZkController().stopReplicationFromLeader(...)|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1503-L15]
 - preempts the periodically-triggered "background" replication done by PULL 
and TLOG replicas in the course of normal operation

[DefaultSolrCoreState.cancelRecovery|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391]
 calls RecoveryStrategy.close(), which sets a boolean flag that the 
RecoveryStrategy checks at various points and can be used to early-exit.  This 
all works great so far as it goes.  The problem though is that the bulk of the 
logic for a full-recovery, including the loop to iterate over and fetch index 
files from the leader, lives elsewhere (particularly 
[ReplicationHandler.doFetch|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/ReplicationHandler.java#L452]
 and 
[IndexFetcher.fetchLatestIndex|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L405])
 and doesn't have access to RecoveryStrategy and its early-exit flag.  So in 
practice once a full-recovery starts fetching files from the leader, it won't 
check RecoveryStrategy's early-exit flag again until it's finished.

Interestingly, IndexFetcher has [its own early-exit 
flag|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1569-L1571]
 that can be set by a call to IndexFetcher.destroy(), and it is already used 
for preemption in the PULL/TLOG replica "background replication" scenario.  
This flag [_does_ get 
checked|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1747-L1751]
 in the "iterate and fetch each leader index file" loop, so it seems like a 
really great option for our case, if we can find a way to set it from 
[DefaultSolrCoreState.cancelRecovery|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L383-L391].

> DELETEREPLICA should preempt full-recovery instead of waiting for completion
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-17692
>                 URL: https://issues.apache.org/jira/browse/SOLR-17692
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java), SolrCloud
>            Reporter: Jason Gerlowski
>            Priority: Major
>
> I recently deleted a NRT replica that was in the middle of a full-recovery 
> and was a bit surprised to see that the "delete" blocked waiting for the 
> recovery to finish.  This is a minor pain when the index is small, but 
> becomes a huge waste of administrator time (and network bandwidth!) as index 
> sizes grow.
> There's some plumbing in Solr that attempts to preempt recovery during a 
> DELETE, but it appears that it seems that it mostly comes into play during 
> peer-sync and "background replication" scenarios (i.e. PULL and TLOG replicas 
> that do full-recovery during normal operation).  Preemption doesn't seem to 
> work once a recovering core is in the midst of a "full recovery".  We should 
> modify this code that it stops full-recovery as well, unless there's some 
> compelling reason this was avoided in the initial implementation?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to