[jira] [Commented] (SOLR-17160) Bulk admin operations may fail because of max tracked requests

David Smiley (Jira) Fri, 01 Mar 2024 18:23:47 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822738#comment-17822738
 ]


David Smiley commented on SOLR-17160:
-------------------------------------

Even at the top level (ZK), the async IDs are capped at 10K before getting 
purged -- SizeLimitedDistributedMap with 
org.apache.solr.cloud.Overseer#NUM_RESPONSES_TO_STORE (10K).

100 is clearly way too low.

> Bulk admin operations may fail because of max tracked requests
> --------------------------------------------------------------
>
>                 Key: SOLR-17160
>                 URL: https://issues.apache.org/jira/browse/SOLR-17160
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 8.11, 9.5
>            Reporter: Pierre Salagnac
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In {{{}CoreAdminHandler{}}}, we maintain in-memory the list of in-flight 
> requests and completed/failed request.
> _Note they are core/replica level async requests, and not top level requests 
> which mostly at the collection level. Top level requests are tracked by 
> storing the async ID in a Zookeeper node, which is not related to this 
> ticket._
>  
> For completed/failed requests, we only track a maximum of 100 requests by 
> dropping the oldest ones. The typical client in 
> {{CollectionHandlingUtils.waitForCoreAdminAsyncCallToComplete()}} polls 
> status of the submitted requests, with a retry loop until requests are 
> completed. If for some reason we have more than 100 requests that complete or 
> fail on a node before all statuses are polled by the client, the statuses are 
> lost and the client will fail with an unexpected error similar to:
> {{Invalid status request for requestId: '{_}<id>{_}' - 'notfound'. Retried 
> _<n>_ times}}
>  
> Instead of having a hard limit for the number of requests we track, we could 
> have time based eviction. I think it makes sense to keep status of a request 
> until a given timeout, and then drop it ignoring how many requests we 
> currently track.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17160) Bulk admin operations may fail because of max tracked requests

Reply via email to