[ 
https://issues.apache.org/jira/browse/SOLR-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975807#comment-15975807
 ] 

Mike Drob commented on SOLR-10525:
----------------------------------

It occurs to me that with this change if there is a steady stream of recovery 
requests then each one will start up and get cancelled. I don't have a good 
grasp on how heavy the start cost here is, it looks like mostly checking the 
update log in RecoveryStrategy.

The previous behaviour would be to run a full recovery and then run another 
full recovery. If failures were still happening at that point, then we could 
queue up another recovery. So we're not impacting availability when the root 
cause hasn't been fixed.

It does get better if the root cause is more temporal, since the final recovery 
can start as soon as the problem is fixed, instead of waiting for another 
recovery to finish.

I think I've convinced myself to not worry about the churn, but would like a 
second opinion.

> Stacked recovery requests can interfere with one another
> --------------------------------------------------------
>
>                 Key: SOLR-10525
>                 URL: https://issues.apache.org/jira/browse/SOLR-10525
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>         Attachments: SOLR-10525.patch
>
>
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L300-L310
> Two issues with this code:
> {code}
>           boolean locked = recoveryLock.tryLock();
>           try {
>             if (!locked) {
>               if (recoveryWaiting.get() > 0) { // line 1
>                 return;
>               }
>               recoveryWaiting.incrementAndGet(); // line 2
>             } else {
>               recoveryWaiting.incrementAndGet();
>               cancelRecovery(); // line 3
> }
> {code}
> The {{cancelRecovery}} on line 3 call will only hit when there are no 
> recoveries to actually cancel (since we got the lock that means there are no 
> recoveries in progress). Instead it should be moved either to the either 
> branch of the if, or outside after the if since we know we will be running a 
> recovery at that point.
> This code doesn't always prevent multiple requests from stacking. If there is 
> a recovery running, but no recoveries currently waiting, multiple requests 
> can check the count at line 1 before any of them will increment the count at 
> line 2 and thus all of them will hit the increment.
> I don't have specific tests for this, but it's causing failures for me on my 
> SOLR-9555 work in progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to