[jira] [Commented] (SOLR-12729) SplitShardCmd should lock the parent shard to prevent parallel splitting requests

Jason Gerlowski (Jira) Wed, 03 Sep 2025 07:11:39 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017912#comment-18017912
 ]


Jason Gerlowski commented on SOLR-12729:
----------------------------------------

Is my understanding above correct [~ab]?  Am I missing anything?

If so, wdyt about changing the locking to only cover the split itself (i.e. NOT 
cover sub-shard replica recovery).  Replicas do full-recovery all the time 
without the benefit of any special locking.  If we're worried about some 
automation re-triggering the split while replicas are in recovery, then we can 
detect that and abort in SplitShardCmd.  [SplitShardCmd already has a check for 
"active" 
sub-shards|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L331-L334],
 seems reasonable to extend that to cover "recovering" sub-shards as well?

> SplitShardCmd should lock the parent shard to prevent parallel splitting 
> requests
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-12729
>                 URL: https://issues.apache.org/jira/browse/SOLR-12729
>             Project: Solr
>          Issue Type: Bug
>          Components: AutoScaling
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 7.6, 8.0
>
>
> This scenario was discovered by the simulation framework, but it exists also 
> in the non-simulated code.
> When {{IndexSizeTrigger}} requests SPLITSHARD, which is then successfully 
> started and “completed” from the point of view of {{ExecutePlanAction}}, the 
> reality is that it still can take significant amount of time until the moment 
> when the new replicas fully recover and cause the switch of shard states 
> (parent to INACTIVE, child from RECOVERY to ACTIVE).
> If this time is longer than the trigger's {{waitFor}} the trigger will issue 
> the same SPLITSHARD request again. {{SplitShardCmd}} doesn't prevent this new 
> request from being processed because the parent shard is still ACTIVE. 
> However, a section of the code in {{SplitShardCmd}} will realize that 
> sub-slices with the target names already exist and they are not active, at 
> which point it will delete the new sub-slices ({{SplitShardCmd:182}}).
> The end result is an infinite loop, where {{IndexSizeTrigger}} will keep 
> generating SPLITSHARD, and {{SplitShardCmd}} will keep deleting the 
> recovering sub-slices created by the previous command.
> A simple solution is for the parent shard to be marked to indicate that it’s 
> in a process of splitting, so that no other split is attempted on the same 
> shard. Furthermore, {{IndexSizeTrigger}} could temporarily exclude such 
> shards from monitoring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12729) SplitShardCmd should lock the parent shard to prevent parallel splitting requests

Reply via email to