[
https://issues.apache.org/jira/browse/SOLR-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-4629:
---------------------------
Attachment:
SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch
After adding a lot of debug logs, and walking through the results of lots of
failed tests compared to successful tests, and a lot of vigorous, physical
consultation between my forehead and my desk i think i've finally tracked down
the cause of all the "expected 2 got 3" failures from checkForSingleIndex.
The problem in a nutshell is one of concurrency. When the test thread makes a
request to the master or to the slave those requests are handled by a jetty
thread which (via SolrDispatchFilter) creates a SolrQueryRequest, which has a
searcher ref, which has a Directory ref. When the request is done, the
SolrQueryRequest is closed, hich releases the searcher ref, which releases the
directory ref -- but by the time this happens, the response has already been
returned to the "client" (the test thread), and the test thread may enter
checkForSingleIndex to acquire the lock on the CacheDirectoryFactory (to check
the list of cached paths) before the resources from a previos rquest have been
completely released -- so the test fails because an Directory from an old
request still hasn't been released.
Example...
{noformat}
Time Test-Thread Jetty-Thread-N
0 http request->jetty
1 accept http request
2 create solr query request
3 incref searcher, incref dir
4 process solr query request
5 test thread<-write http response
6 process response
7 ...
8 assert(2=num dirs)
9 decref seracher, decref dir, release dir
{noformat}
I think the key change is to modify checkForSingleIndex so that instead of
asserting exactly 2 paths in the cache, we assert that there are only 2 paths
that are not "done" -- allowing for the possibility of other paths still being
tracked because of requests still being closed.
The attached patch makes this change -- there are still some nocommits (in
particular i completely commented out hte replication core reloading to rule
that out as a possible cause, but there's also some excessively absurd logging)
but even if you ignore all that, after replacing
"CachingDirectoryFactory.getPaths()" with
"CachingDirectoryFactory.getLivePaths()" I have yet to see "expected:<2> but
was:<3>" in any test run. If you tweak that method to eliminate the
"!val.doneWithDir" dir check, you should start seeing the failures come back.
I'll clean the patch up more tomorow and run some more exhaustive tests to be
sure i haven't broken anything, but i wanted to post what i had in case i got
hit by a buss (and to ensure [[email protected]] doesn't see any flaw with
my "getLivePaths()" change before i get too happy about it)
> Stronger standard replication testing.
> --------------------------------------
>
> Key: SOLR-4629
> URL: https://issues.apache.org/jira/browse/SOLR-4629
> Project: Solr
> Issue Type: Test
> Components: replication (java)
> Reporter: Mark Miller
> Assignee: Mark Miller
> Fix For: 4.3, 5.0, 4.2.1
>
> Attachments:
> SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch,
> SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch,
> SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch,
> SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch
>
>
> I added to these tests recently, but there is a report on the list indicating
> we may still be missing something. Most reports have been positive so far
> after the 4.2 fixes, but I'd feel better after adding some more testing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]