[ 
https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978371#comment-16978371
 ] 

Andrzej Bialecki commented on SOLR-13945:
-----------------------------------------

Did you run this scenario on a collection with replicationFactor==1? If not, 
then I think there must be something else going on here.

The call to {{commit}} assumes the parent shard is still ACTIVE, and it should 
be when repFactor > 1, because in this case the parent/sub shard states are 
switched to inactive/active not in SplitShardCmd but in ReplicaMutator, and 
then only *after* all replicas in the new sub-shards finished recovering and 
are ACTIVE.

This doesn't seem to be the case when repFactor == 1, because that's when the 
parent/sub shard states are switched immediately, and then indeed the call to 
commit would produce an error. This is something that we can easily fix by 
calling the commit earlier, or not treating this situation as a special case.

This code section goes as far back as branch_5x (!) and I can't say I fully 
understand the need for this commit here - I suspect it has something to do 
with buffering updates and tlog replays between parent and sub-shards, so 
removing it may result in other subtle data loss.

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state 
> changes* have happened, i.e. from active/construction/construction to 
> inactive/active/active. Please see 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called 
> "cleanupAfterFailure" in the finally block that resets the state to 
> active/construction/construction. Please see: 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any 
> documents that went into the subshards (because they are already active by 
> now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is 
> inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already 
> active and parent is inactive, then what is the need for setting them back to 
> construction? Seems like a crucial check is missing there. Also, why do we 
> reset the subshard status back to construction instead of inactive? It is 
> extremely misleading (and, frankly, ridiculous) for any external clusterstate 
> monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to 
> CONSTRUCTION and then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to