[ 
https://issues.apache.org/jira/browse/SOLR-9366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-9366:
---------------------------
    Attachment: SOLR-9366.txt.gz


sarowe's jenkins identified that this test fails fairly reliably on master 
(circa d86c369533)...

{noformat}
ant test  -Dtestcase=TestStressCloudBlindAtomicUpdates  
-Dtests.seed=60C9AAFC94FD8921 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
-Dtests.locale=id -Dtests.timezone=America/Puerto_Rico -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{noformat}

NOTE: {{test_dv_stored}} is the specific test method that fails, but including 
{{-Dtest.method}} in the reproduce line won't work due to an unrelated bug in 
TestInjection (jira link to follow once i get a chance to file it)

----

The attached (gzipped) SOLR-9366.txt.gz log file is from a run of that seed on 
my computer, pruned down to only the output from test_dv_stored.

Note that MinSolrCluster collection has 5 shards with a repFactor of 2.  the 
test indexes 332 docs, and during the process encounters many errors caused by 
TestInjection -- ultimately leading to all non-leader replicas to be in 
recovery.

After all 332 docs are added, the test issues an explicit commit, which all of 
the replicas in recovery log that they are explicitly ignoring...

{noformat}
   [junit4]   2> 96495 INFO  (qtp1014592735-82) [n:127.0.0.1:53326_solr 
c:test_col s:shard1 r:core_node9 x:test_col_shard1_replica2] 
o.a.s.u.p.DistributedU
pdateProcessor Ignoring commit while not ACTIVE - state: BUFFERING replay: false

   [junit4]   2> 96497 INFO  (qtp463019022-84) [n:127.0.0.1:51890_solr 
c:test_col s:shard2 r:core_node4 x:test_col_shard2_replica1] 
o.a.s.u.p.DistributedUpdateProcessor Ignoring commit while not ACTIVE - state: 
BUFFERING replay: false

   [junit4]   2> 96503 INFO  (qtp574107834-44) [n:127.0.0.1:45722_solr 
c:test_col s:shard4 r:core_node10 x:test_col_shard4_replica2] 
o.a.s.u.p.DistributedUpdateProcessor Ignoring commit while not ACTIVE - state: 
BUFFERING replay: false

   [junit4]   2> 96504 INFO  (qtp1497478396-522) [n:127.0.0.1:60681_solr 
c:test_col s:shard3 r:core_node3 x:test_col_shard3_replica1] 
o.a.s.u.p.DistributedUpdateProcessor Ignoring commit while not ACTIVE - state: 
BUFFERING replay: false

   [junit4]   2> 96510 INFO  (qtp1947867562-490) [n:127.0.0.1:35941_solr 
c:test_col s:shard5 r:core_node6 x:test_col_shard5_replica1] 
o.a.s.u.p.DistributedUpdateProcessor Ignoring commit while not ACTIVE - state: 
BUFFERING replay: false
{noformat}

...meanwhile the 5 leaders are all committing & opening new searchers...

{noformat}

   [junit4]   2> 96503 INFO  
(searcherExecutor-70-thread-1-processing-n:127.0.0.1:50832_solr 
x:test_col_shard1_replica1 s:shard1 c:test_col r:core_node2) 
[n:127.0.0.1:50832_solr c:test_col s:shard1 r:core_node2 
x:test_col_shard1_replica1] o.a.s.c.SolrCore [test_col_shard1_replica1] 
Registered new searcher Searcher@641bc113[test_col_shard1_replica1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_y(7.0.0):C65)))}

   [junit4]   2> 96508 INFO  
(searcherExecutor-71-thread-1-processing-n:127.0.0.1:43905_solr 
x:test_col_shard2_replica2 s:shard2 c:test_col r:core_node8) 
[n:127.0.0.1:43905_solr c:test_col s:shard2 r:core_node8 
x:test_col_shard2_replica2] o.a.s.c.SolrCore [test_col_shard2_replica2] 
Registered new searcher Searcher@2fd5ffdc[test_col_shard2_replica2] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_v(7.0.0):C65)))}

   [junit4]   2> 96511 INFO  
(searcherExecutor-68-thread-1-processing-n:127.0.0.1:56607_solr 
x:test_col_shard3_replica2 s:shard3 c:test_col r:core_node5) 
[n:127.0.0.1:56607_solr c:test_col s:shard3 r:core_node5 
x:test_col_shard3_replica2] o.a.s.c.SolrCore [test_col_shard3_replica2] 
Registered new searcher Searcher@29273482[test_col_shard3_replica2] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_15(7.0.0):C75)))}

   [junit4]   2> 96513 INFO  
(searcherExecutor-69-thread-1-processing-n:127.0.0.1:36018_solr 
x:test_col_shard5_replica2 s:shard5 c:test_col r:core_node1) 
[n:127.0.0.1:36018_solr c:test_col s:shard5 r:core_node1 
x:test_col_shard5_replica2] o.a.s.c.SolrCore [test_col_shard5_replica2] 
Registered new searcher Searcher@126ca96c[test_col_shard5_replica2] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_w(7.0.0):C68)))}

   [junit4]   2> 96515 INFO  
(searcherExecutor-72-thread-1-processing-n:127.0.0.1:45414_solr 
x:test_col_shard4_replica1 s:shard4 c:test_col r:core_node7) 
[n:127.0.0.1:45414_solr c:test_col s:shard4 r:core_node7 
x:test_col_shard4_replica1] o.a.s.c.SolrCore [test_col_shard4_replica1] 
Registered new searcher Searcher@5204d7bc[test_col_shard4_replica1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_q(7.0.0):C59)))}
{noformat}

Once the test gets a "success" status from the SolrCloudClient in response to 
the commit request, it attempts to do a {{\*:\*}} search to sanity check that 
the index contains the expected docs before proceeding with the actual stuff 
the test cares about -- but this assertion fails, because some of those 
replicas that are in recovery are consulted as part of the search...

{noformat}
   [junit4]   2> 96526 INFO  (qtp1665435692-109) [n:127.0.0.1:56607_solr 
c:test_col s:shard3 r:core_node5 x:test_col_shard3_replica2] o.a.s.c.S.Request 
[test_col_shard3_replica2]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:9&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:56607/solr/test_col_shard3_replica2/&rows=10&version=2&q=*:*&NOW=1470079073107&isShard=true&wt=javabin}
 hits=75 status=0 QTime=0

   [junit4]   2> 96527 INFO  (qtp831283616-488) [n:127.0.0.1:43905_solr 
c:test_col s:shard2 r:core_node8 x:test_col_shard2_replica2] o.a.s.c.S.Request 
[test_col_shard2_replica2]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:9&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:43905/solr/test_col_shard2_replica2/&rows=10&version=2&q=*:*&NOW=1470079073107&isShard=true&wt=javabin}
 hits=65 status=0 QTime=0
   [junit4]   2> 96530 INFO  (qtp574107834-44) [n:127.0.0.1:45722_solr 
c:test_col s:shard4 r:core_node10 x:test_col_shard4_replica2] o.a.s.c.S.Request 
[test_col_shard4_replica2]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:9&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:45414/solr/test_col_shard4_replica1/|http://127.0.0.1:45722/solr/test_col_shard4_replica2/&rows=10&version=2&q=*:*&NOW=1470079073107&isShard=true&wt=javabin}
 hits=0 status=0 QTime=0
   [junit4]   2> 96531 INFO  (qtp287931206-46) [n:127.0.0.1:50832_solr 
c:test_col s:shard1 r:core_node2 x:test_col_shard1_replica1] o.a.s.c.S.Request 
[test_col_shard1_replica1]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:9&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:50832/solr/test_col_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1470079073107&isShard=true&wt=javabin}
 hits=65 status=0 QTime=0
   [junit4]   2> 96532 INFO  (qtp1947867562-96) [n:127.0.0.1:35941_solr 
c:test_col s:shard5 r:core_node6 x:test_col_shard5_replica1] o.a.s.c.S.Request 
[test_col_shard5_replica1]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:9&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:36018/solr/test_col_shard5_replica2/|http://127.0.0.1:35941/solr/test_col_shard5_replica1/&rows=10&version=2&q=*:*&NOW=1470079073107&isShard=true&wt=javabin}
 hits=0 status=0 QTime=0
   [junit4]   2> 96533 INFO  (qtp831283616-487) [n:127.0.0.1:43905_solr 
c:test_col s:shard2 r:core_node8 x:test_col_shard2_replica2] o.a.s.c.S.Request 
[test_col_shard2_replica2]  webapp=/solr path=/select 
params={q=*:*&distrib=false&_stateVer_=test_col:9&shards.purpose=64&NOW=1470079073107&ids=0,35,13,36,15,4,16,8,41,32&isShard=true&shard.url=http://127.0.0.1:43905/solr/test_col_shard2_replica2/&wt=javabin&version=2}
 status=0 QTime=0
   [junit4]   2> 96534 INFO  (qtp287931206-461) [n:127.0.0.1:50832_solr 
c:test_col s:shard1 r:core_node2 x:test_col_shard1_replica1] o.a.s.c.S.Request 
[test_col_shard1_replica1]  webapp=/solr path=/select 
params={q=*:*&_stateVer_=test_col:9&wt=javabin&version=2} hits=205 status=0 
QTime=13

{noformat}

...and BOOM, the test fails before it ever really starts because 332 docs 
expected != the 205 numFound returned by the search.




> replicas in LIR seem to still be participating in search requests
> -----------------------------------------------------------------
>
>                 Key: SOLR-9366
>                 URL: https://issues.apache.org/jira/browse/SOLR-9366
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-9366.txt.gz
>
>
> Spinning this off from SOLR-9363 where sarowe's jenkins encountered a strange 
> test failure when TestInjection is used causing replicas to return errors on 
> some requests.
> Reading over the logs it appears that when this happens, and the replicas are 
> put into LIR they then ignore an explicit user requested commit (ie: 
> {{Ignoring commit while not ACTIVE - state: BUFFERING replay: false}}) but 
> still participate in queries -- apparently before they finish recovery (or at 
> the very least before / with-out doing the commit/openSearcher that they 
> previously ignored.
> Details and logs to follow...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to