[ 
https://issues.apache.org/jira/browse/SOLR-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982059#comment-15982059
 ] 

Shalin Shekhar Mangar commented on SOLR-10562:
----------------------------------------------

This broke my mental model of SolrCloud and all the Jepsen tests that I did, so 
I had to dig deeper.

The reason we see this behavior in this test is because SolrCLI updates the 
autoSoftCommit settings using the Config API when running Solr in cloud mode. 
The Config API internally reloads all the replicas belonging to the collection 
asynchronously. This reload then interacts in a weird way in the test log that 
Erick attached.

Here is the sequence of events on testCloudExamplePrompt_shard1_replica2
# T24126 Doc0 is added to IW
# T24388 Core is being reloaded -- this closes the old IW and opens a new one 
thereby committing doc0 to disk
    ## Other docs keep getting added to IW's buffer (remember IW is shared 
between old and new core, reload is still in progress, so all requests are 
going to old core)
    ## T24445 Indexing finishes, client calls hard commit on old core!
    ## T24693 new reloaded core opens Searcher@59ceb956 using the IW -- this 
should have all the data -- note that hard commit is still in progress
    ## T24703 Searcher@59ceb956 is registered with new core
    ## New core is fully loaded and active -- will take part in all future 
requests
# Old core opens searcher Searcher@6e0502e7 which goes to disk directly 
bypassing IW buffer
    ## T24766 Searcher@6e0502e7 is registered with old core
    ## T24767 commit finishes, and old core is closed because this commit 
request was the last one holding the old core's reference!
# All search requests keep going to testCloudExamplePrompt_shard1_replica2 
which has a reloaded core with old data (i.e. 1 doc only)

Finally a search request hits a different replica 
testCloudExamplePrompt_shard1_replica1 and gets the latest count of the 
documents.

*TL;DR*
Indexing during reloads can cause apparent data loss -- the data exists on disk 
but may not be searchable if a commit operation starts before the core reload 
completes due to configuration changes. In such a state, an RTG will show the 
latest data and any subsequent commits will make the data visible to searchers.

> CloudSolrClient.commit can return before docs are searchable.
> -------------------------------------------------------------
>
>                 Key: SOLR-10562
>                 URL: https://issues.apache.org/jira/browse/SOLR-10562
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.6, master (7.0)
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: 1_1.res, 1_2.res, 2_1.res, 2_2.res, debug.patch, 
> runcli_12.log
>
>
> I've been beating the heck out of some test cases for fear that
> SOLR-10007 really messed things up and I can get a pretty regular test
> failure for TestSolrCLIRunExample.testInteractiveSolrCloudExample, but
> it doesn't make sense.
> So I went back to a revision _before_ SOLR-10007 and it still fails.
> But the failure is "impossible". I put a bunch of log.error messages
> in and, for experimental purposes a for loop in the test. Here's the
> lines that fail in the original:
> {code}
> for (idx = 0; idx < 10; ++idx) {
>  construct a SolrInputDoc and then:
>   cloudClient.add(SolrInputDoc);
> }
> cloudClient.commit();
> QueryResponse qr = cloudClient.query(new SolrQuery("str_s:a"));
> if (qr.getResults().getNumFound() != numDocs) {
>   fail("Expected "+numDocs+" to be found in the "+collectionName+
>       " collection but only found "+qr.getResults().getNumFound());
> }
> {code}
> If I put the above (not the commit, just the query and the test) in a
> loop and check the query 10 times with a 1 second sleep if the numDocs
> != getNumFound(). Quite regularly I'll see a message in the log file
> that getNumFound() != numDocs, but after a few loops getNumFound() ==
> numDocs and the test succeeds.
> cloudClient is what you'd expect:
> cloudClient = 
> getCloudSolrClient(executor.solrCloudCluster.getZkServer().getZkAddress());
> So unless I'm hallucinating, any tests that rely on
> cloudClient.commit() insuring that all docs sent to the cluster are
> searchable will potentially fail on occasion.
> I looked over the JIRAs briefly and don't see any mentions, of a
> similar problem, but I may have missed it.
> The logging I'm writing from the update handler _seems_ to show it to be 
> doing the right thing. Just late.
> I'll attach some data along with a "patch" which generates the logging 
> information. I also attempted to submit a single batch rather than 10 
> individual docs and that fails too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to