[jira] [Commented] (SOLR-17160) Bulk admin operations may fail because of max tracked requests

ASF subversion and git services (Jira) Tue, 16 Jul 2024 13:29:17 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866531#comment-17866531
 ]


ASF subversion and git services commented on SOLR-17160:
--------------------------------------------------------

Commit 338fbdacbecaba76741fe5f8b755e2374f88d8a0 in solr's branch 
refs/heads/backport_SOLR-16842_to_branch_9x from Pierre Salagnac
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=338fbdacbec ]

SOLR-17160: Core admin async ID status, 10k limit and time expire (#2304)

Core Admin "async" request status tracking is no longer capped at 100; it's 10k.
Statuses are now removed 5 minutes after the read of a completed/failed status.
Helps collection async backup/restore and other operations scale to 100+ shards.

Co-authored-by: David Smiley <[email protected]>
(cherry picked from commit d3b4c2e1ae39b8ecc5428798531f8b7cf723d787)


> Bulk admin operations may fail because of max tracked requests
> --------------------------------------------------------------
>
>                 Key: SOLR-17160
>                 URL: https://issues.apache.org/jira/browse/SOLR-17160
>             Project: Solr
>          Issue Type: Bug
>          Components: Backup/Restore
>    Affects Versions: 8.11, 9.5
>            Reporter: Pierre Salagnac
>            Priority: Minor
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In {{{}CoreAdminHandler{}}}, we maintain in-memory the list of in-flight 
> requests and completed/failed request.
> _Note they are core/replica level async requests, and not top level requests 
> which mostly at the collection level. Top level requests are tracked by 
> storing the async ID in a Zookeeper node, which is not related to this 
> ticket._
>  
> For completed/failed requests, we only track a maximum of 100 requests by 
> dropping the oldest ones. The typical client in 
> {{CollectionHandlingUtils.waitForCoreAdminAsyncCallToComplete()}} polls 
> status of the submitted requests, with a retry loop until requests are 
> completed. If for some reason we have more than 100 requests that complete or 
> fail on a node before all statuses are polled by the client, the statuses are 
> lost and the client will fail with an unexpected error similar to:
> {{Invalid status request for requestId: '{_}<id>{_}' - 'notfound'. Retried 
> _<n>_ times}}
>  
> Instead of having a hard limit for the number of requests we track, we could 
> have time based eviction. I think it makes sense to keep status of a request 
> until a given timeout, and then drop it ignoring how many requests we 
> currently track.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17160) Bulk admin operations may fail because of max tracked requests

Reply via email to