[ 
https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978536#comment-16978536
 ] 

Ishan Chattopadhyaya commented on SOLR-13945:
---------------------------------------------

Thanks for your patch. +1 to Yonik's suggestion. We really need that for this 
500 line method, which is so hard to grasp!

bq. Please try the patch that I attached just now.
I think we should add a defensive check at the beginning of your cleanup 
method, on the lines of: if both subshards are active, don't do anything and 
bail out.

{code}
+      if (repFactor > 1) {
+        t = timings.sub("finalCommit");
+        ocmh.commit(results, slice.get(), parentShardLeader);
+        t.stop();
+      }
{code}
Also, can you please add a comment on why repFactor>1 needs a special handling 
here?

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state 
> changes* have happened, i.e. from active/construction/construction to 
> inactive/active/active. Please see 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called 
> "cleanupAfterFailure" in the finally block that resets the state to 
> active/construction/construction. Please see: 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any 
> documents that went into the subshards (because they are already active by 
> now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is 
> inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already 
> active and parent is inactive, then what is the need for setting them back to 
> construction? Seems like a crucial check is missing there. Also, why do we 
> reset the subshard status back to construction instead of inactive? It is 
> extremely misleading (and, frankly, ridiculous) for any external clusterstate 
> monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to 
> CONSTRUCTION and then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to