[ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978371#comment-16978371 ]
Andrzej Bialecki commented on SOLR-13945: ----------------------------------------- Did you run this scenario on a collection with replicationFactor==1? If not, then I think there must be something else going on here. The call to {{commit}} assumes the parent shard is still ACTIVE, and it should be when repFactor > 1, because in this case the parent/sub shard states are switched to inactive/active not in SplitShardCmd but in ReplicaMutator, and then only *after* all replicas in the new sub-shards finished recovering and are ACTIVE. This doesn't seem to be the case when repFactor == 1, because that's when the parent/sub shard states are switched immediately, and then indeed the call to commit would produce an error. This is something that we can easily fix by calling the commit earlier, or not treating this situation as a special case. This code section goes as far back as branch_5x (!) and I can't say I fully understand the need for this commit here - I suspect it has something to do with buffering updates and tlog replays between parent and sub-shards, so removing it may result in other subtle data loss. > SPLITSHARD data loss due to "rollback" > -------------------------------------- > > Key: SOLR-13945 > URL: https://issues.apache.org/jira/browse/SOLR-13945 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Ishan Chattopadhyaya > Priority: Major > Attachments: SOLR-13945.patch, SOLR-13945.patch > > > # As per SOLR-7673, there is a commit on the parent shard *after state > changes* have happened, i.e. from active/construction/construction to > inactive/active/active. Please see > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588 > # Due to SOLR-12509, there's now a cleanup/rollback method called > "cleanupAfterFailure" in the finally block that resets the state to > active/construction/construction. Please see: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657 > # When 2 is entered into due to a failure in 1, we have a situation where any > documents that went into the subshards (because they are already active by > now) are now lost after the parent becomes active. > If my above understanding is correct, I am wondering: > # Why is a commit to parent shard needed *after* the parent shard is > inactive, subshards are now active and the split operation has completed? > # This rollback looks very suspicious. If state of subshards is already > active and parent is inactive, then what is the need for setting them back to > construction? Seems like a crucial check is missing there. Also, why do we > reset the subshard status back to construction instead of inactive? It is > extremely misleading (and, frankly, ridiculous) for any external clusterstate > monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to > CONSTRUCTION and then the subshard disappearing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org