[ 
https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya resolved SOLR-13945.
-----------------------------------------
    Fix Version/s: 8.4
       Resolution: Fixed

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>             Fix For: 8.4
>
>         Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state 
> changes* have happened, i.e. from active/construction/construction to 
> inactive/active/active. Please see 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called 
> "cleanupAfterFailure" in the finally block that resets the state to 
> active/construction/construction. Please see: 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any 
> documents that went into the subshards (because they are already active by 
> now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is 
> inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already 
> active and parent is inactive, then what is the need for setting them back to 
> construction? Seems like a crucial check is missing there. Also, why do we 
> reset the subshard status back to construction instead of inactive? It is 
> extremely misleading (and, frankly, ridiculous) for any external clusterstate 
> monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to 
> CONSTRUCTION and then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to