[ 
https://issues.apache.org/jira/browse/SOLR-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-11035:
----------------------------
    Attachment: SOLR-11035.patch
                log.txt


This issue is the root cause of some suspicious (mostly cloud based) test 
failures that -- on the surface -- may initially seem like they are either 
'impossible' or that the commit requests done by the tests are returning before 
the newSearcher was opened (even though {{waitSearcher=true}}).

In my personal experience, the crux of the symptoms are:
* Documents missing from searches done immediately after they were 
(successfully) added+commited
* Seeds don't reproduce reliably, but fail more commonly when running many 
tests and/or CPU usage on the test machine is high.

An example of this class of test failure that initially caught my eye, and I 
could typically get to reproduce semi-frequently is 
DocValuesNotIndexedTest.testGroupingDVOnly.  Running only that test method -- 
or even that entire class -- never failed for me, but if i ran {{ant test 
-Dtests.class=\*DocValues\*}} I could usually get a consistent failure within 
~15 attempts.

One thing most of these failing tests have in common is that instead of a 
static schema, they use managed-schema and either use the Schema API to 
explicitly create the fields needed, or let the schemaless update processors 
add fields as needed.

In comparing the logs of "successful" runs vs "failure" runs, the thing that 
jumped out at me is that the failure runs contain more SolrCore "reloads" (due 
to SOLR-11034) and in every failure case i saw the "CLOSING" log message of a 
SolrCore was occuring _after_ the document adds & commit had been logged by 
that core, but _before_ the log messages from the searches that failed 
validation (meaning they were processed by the newly (re)loaded core)

As Shalin notes in SOLR-10562 (where the Congif API is used causing a core 
reload) in the process of closing the "old" core and opening the "new" core, 
there is a race condition where the "old" core will accept/log/write 
documents/commits to the index, but these will not be immediately visible to 
the SolrIndexSearcher that the "new" core is using once the hand-off is made.

I wrote the attached test class to try and demonstrate this problem in it's 
simplest form (independent of ManagedIndexSchema or SolrCloud or Config API) 
using concurrent 2 threads: one to do an "add, commit(waitSearcher), query" and 
the second to trigger a core reload.

In addition to reproducing the failure I expected (where the query result don't 
reflect the last add+commit) i was able to trigger the second type of failure 
mentioned above: queries can be routed to the "old" core after/during it's 
shutdown, causing a solr exception because the (old) core refuses to open a new 
searcher.

I tweaked my test as originally written to include a HACK making it possible to 
demonstrate both types of failures in a single run -- see the attached log for 
full details....

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestReloadConcurrentAdds -Dtests.method=testConcurrent 
-Dtests.seed=5A728C1E8654399 -Dtests.slow=true -Dtests.locale=es-PA 
-Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.57s | TestReloadConcurrentAdds.testConcurrent <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([5A728C1E8654399:25D4A7DB2FCE2B30]:0)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:878)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:845)
   [junit4]    >        at 
org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:103)
   [junit4]    >        at 
org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:49)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: org.apache.solr.common.SolrException: 
openNewSearcher called on closed core
{noformat}

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestReloadConcurrentAdds 
-Dtests.method=testConcurrentWithNewSearcherExceptionHack 
-Dtests.seed=5A728C1E8654399 -Dtests.slow=true -Dtests.locale=es-PA 
-Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.81s | 
TestReloadConcurrentAdds.testConcurrentWithNewSearcherExceptionHack <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([5A728C1E8654399:1C989F4D65319F55]:0)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:878)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:845)
   [junit4]    >        at 
org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:103)
   [junit4]    >        at 
org.apache.solr.core.TestReloadConcurrentAdds.testConcurrentWithNewSearcherExceptionHack(TestReloadConcurrentAdds.java:55)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//result[@numFound='33']
   [junit4]    >        xml response was: <?xml version="1.0" encoding="UTF-8"?>
   [junit4]    > <response>
   [junit4]    > <lst name="responseHeader">
   [junit4]    >   <int name="status">0</int>
   [junit4]    >   <int name="QTime">0</int>
   [junit4]    >   <lst name="params">
   [junit4]    >     <str name="q">*:*</str>
   [junit4]    >     <str name="rows">0</str>
   [junit4]    >     <str name="wt">xml</str>
   [junit4]    >   </lst>
   [junit4]    > </lst>
   [junit4]    > <result name="response" numFound="32" start="0">
{noformat}


> (at least) 2 distinct failures possible when clients attempt searches during 
> SolrCore reload
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11035
>                 URL: https://issues.apache.org/jira/browse/SOLR-11035
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: log.txt, SOLR-11035.patch
>
>
> If a SolrCore is reloaded, there are (at least) 2 distinct types of failures 
> that clients may observe when executing updates + queries while the reload is 
> in progress...
> * documents may appear missing during queries
> * queries may fail with "SolrException: openNewSearcher called on closed core"
> Details to follow in comment...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to