[
https://issues.apache.org/jira/browse/SOLR-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erick Erickson updated SOLR-10562:
----------------------------------
Attachment: debug.patch
runcli_12.log
2_2.res
2_1.res
1_2.res
1_1.res
BTW, assigned to myself so I don't lose track of it, but I probably won't work
on it for a while so please feel free to take it.
Anyway, the *.res files come from runcli_12.log and show the sequence in
updatehandler for the respective cores.
runcli_12.log shows (look for EOE) is the entire log, you can see that the
commit is called (and returned from) before the query is made. We even look
like we're waiting for the searcher to open. It looks like every shard has its
commit called before we issue the query. So I'm pretty baffled.
I do wonder how many of our sporadic test failures are a result of this
problem, certainly anything that fails because of unexpected document counts is
suspicious.
You can work around this by the logic I put in to test where we loop for, say,
10 seconds or until the doc counts are what we expect, but that's yucky.
The patch is on trunk but against sha 1b81dcde (the one before SOLR-10007)
> CloudSolrClient.commit can return before docs are searchable.
> -------------------------------------------------------------
>
> Key: SOLR-10562
> URL: https://issues.apache.org/jira/browse/SOLR-10562
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 6.6, master (7.0)
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Attachments: 1_1.res, 1_2.res, 2_1.res, 2_2.res, debug.patch,
> runcli_12.log
>
>
> I've been beating the heck out of some test cases for fear that
> SOLR-10007 really messed things up and I can get a pretty regular test
> failure for TestSolrCLIRunExample.testInteractiveSolrCloudExample, but
> it doesn't make sense.
> So I went back to a revision _before_ SOLR-10007 and it still fails.
> But the failure is "impossible". I put a bunch of log.error messages
> in and, for experimental purposes a for loop in the test. Here's the
> lines that fail in the original:
> {code}
> for (idx = 0; idx < 10; ++idx) {
> construct a SolrInputDoc and then:
> cloudClient.add(SolrInputDoc);
> }
> cloudClient.commit();
> QueryResponse qr = cloudClient.query(new SolrQuery("str_s:a"));
> if (qr.getResults().getNumFound() != numDocs) {
> fail("Expected "+numDocs+" to be found in the "+collectionName+
> " collection but only found "+qr.getResults().getNumFound());
> }
> {code}
> If I put the above (not the commit, just the query and the test) in a
> loop and check the query 10 times with a 1 second sleep if the numDocs
> != getNumFound(). Quite regularly I'll see a message in the log file
> that getNumFound() != numDocs, but after a few loops getNumFound() ==
> numDocs and the test succeeds.
> cloudClient is what you'd expect:
> cloudClient =
> getCloudSolrClient(executor.solrCloudCluster.getZkServer().getZkAddress());
> So unless I'm hallucinating, any tests that rely on
> cloudClient.commit() insuring that all docs sent to the cluster are
> searchable will potentially fail on occasion.
> I looked over the JIRAs briefly and don't see any mentions, of a
> similar problem, but I may have missed it.
> The logging I'm writing from the update handler _seems_ to show it to be
> doing the right thing. Just late.
> I'll attach some data along with a "patch" which generates the logging
> information. I also attempted to submit a single batch rather than 10
> individual docs and that fails too.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]