Pierre Salagnac created SOLR-17160:
--------------------------------------

             Summary: Bulk admin operations may fail because of max tracked 
requests
                 Key: SOLR-17160
                 URL: https://issues.apache.org/jira/browse/SOLR-17160
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Backup/Restore
    Affects Versions: 8.11, 9.5
            Reporter: Pierre Salagnac


In {{{}CoreAdminHandler{}}}, we maintain in-memory the list of in-flight 
requests and completed/failed request.

_Note they are core/replica level async requests, and not top level requests 
which mostly at the collection level. Top level requests are tracked by storing 
the async ID in a Zookeeper node, which is not related to this ticket._

 

For completed/failed requests, we only track a maximum of 100 requests by 
dropping the oldest ones. The typical client in 
{{CollectionHandlingUtils.waitForCoreAdminAsyncCallToComplete()}} polls status 
of the submitted requests, with a retry loop until requests are completed. If 
for some reason we have more than 100 requests that complete or fail on a node 
before all statuses are polled by the client, the statuses are lost and the 
client will fail with an unexpected error similar to:

{{Invalid status request for requestId: '{_}<id>{_}' - 'notfound'. Retried 
_<n>_ times}}

 

Instead of having a hard limit for the number of requests we track, we could 
have time based eviction. I think it makes sense to keep status of a request 
until a given timeout, and then drop it ignoring how many requests we currently 
track.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to