[
https://issues.apache.org/jira/browse/SOLR-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980716#comment-15980716
]
Mark Miller commented on SOLR-10525:
------------------------------------
bq. I think I've convinced myself to not worry about the churn, but would like
a second opinion.
No, I do think it's a problem at a high level. I'm actually hoping we get the
other issue so that it won't hit ZK for every failed update. That is part of
the problem now and why I tried to make this so concurrent. Right now, tons of
update failures coming in ends up generating tons of recovery http requests
from the leader to the replica, and I guess with LIR, a ZK contact. Rather than
just hitting ZK the same way in SOLR-9555 though, it would be nice to have
something that would hit ZK at most n times per second or something. An update
fail means, some point later than now, but very soonish, trigger a recovery. We
shouldn't have to do this per update and in fact we really don't want to
anymore.
With that issue fixed, ensuring no recovery stack up no longer really has to
worry about such high concurrency.
> Stacked recovery requests can interfere with one another
> --------------------------------------------------------
>
> Key: SOLR-10525
> URL: https://issues.apache.org/jira/browse/SOLR-10525
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: Mike Drob
> Attachments: SOLR-10525.patch, SOLR-10525.patch
>
>
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L300-L310
> Two issues with this code:
> {code}
> boolean locked = recoveryLock.tryLock();
> try {
> if (!locked) {
> if (recoveryWaiting.get() > 0) { // line 1
> return;
> }
> recoveryWaiting.incrementAndGet(); // line 2
> } else {
> recoveryWaiting.incrementAndGet();
> cancelRecovery(); // line 3
> }
> {code}
> The {{cancelRecovery}} on line 3 call will only hit when there are no
> recoveries to actually cancel (since we got the lock that means there are no
> recoveries in progress). Instead it should be moved either to the either
> branch of the if, or outside after the if since we know we will be running a
> recovery at that point.
> This code doesn't always prevent multiple requests from stacking. If there is
> a recovery running, but no recoveries currently waiting, multiple requests
> can check the count at line 1 before any of them will increment the count at
> line 2 and thus all of them will hit the increment.
> I don't have specific tests for this, but it's causing failures for me on my
> SOLR-9555 work in progress.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]