[
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-8914:
---------------------------
Attachment: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt
The specific assertion that failed had to do with a sanity checking query for 2
docs that are added to ensure that some compositId routing prefixes used
throughout the tests do actually corrispond to the expected shard names -- but
for the purpose of discussing this particular jenkins failure, the key things
to know about this test failure are...
* there are 2 shards w/ repfactor=2 which are inited successfully & not
undergoing recovery
* 2 docs are (successful) added, with prefixes such that one is located on each
shard
* a hard commit is (sucessfully) executed from the perspective of the
CloudSolrClient
** {color:red}but only shard1 logs the commit & newSearcher{color}
* a {{\*:\*}} query is executed...
** only 1 doc is found by this query
** an error is logged by a shard1 node that {color:red}no servers hosting
shard: shard2{color}
** {color:red}a shard2 node does in fact log that it was consulted as part of
this query{color}
Which raises 2 key questions I can't make sense of...
* Why didn't either shard2 node ever log (and execute) the commit & opening of
a new searcher?
* Why did a shard1 node log an error that "no servers hosting shard: shard2"
even though the shard2's core_node2 clearly received the first phase of the
distributed request & responded with a success?
I've attached the full log file, and provided my detailed notes, line by line
(of both the relevant code & the log file) below - NOTE: all "Lxx" line numbers
refer to GIT SHA c46d768 since that's what jenkins was testing)
----
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster...
** begins by creating a MiniSolrCloudCluster & initializes a collection of 5
nodes with 2 shards and refactor=2
* timestamp #1302667 - #1302668
** TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L129
** call to AbstractDistribZkTestBase.waitForRecoveriesToFinish
** can see from the log that all 4 nodes are good to go...{noformat}
1302667 INFO
(SUITE-TestTolerantUpdateProcessorCloud-seed#[F0FBD4E7705C29B5]-worker) [ ]
o.a.s.c.AbstractDistribZkTestBase Wait for recoveries to finish - collection:
test_col failOnTimeout:true timeout (sec):330
-
replica:core_node1 rstate:active live:true
replica:core_node4 rstate:active live:true
replica:core_node2 rstate:active live:true
replica:core_node3 rstate:active live:true
no one is recoverying
1302668 INFO
(SUITE-TestTolerantUpdateProcessorCloud-seed#[F0FBD4E7705C29B5]-worker) [ ]
o.a.s.c.AbstractDistribZkTestBase Recoveries finished - collection: test_col
{noformat}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L132 - L183
** test is inspecting ZK state to init some HTTP SolrClients pointed at the
URLs for specific leaders & replicas (these are used to test all possible
forwarding situations in the various test methods)
** timestamp #1302671 shows ZkStateReader used refreshing it's live nodes info
for this purpose (L141)
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L184 - L188
** adding 2 docs with specific routing prefixes
** ultimate goal is to sanity check that the route prefixes match to the
expected shards so other parts of test can assume this.
** timestamp #1302696 - #1302719 record these 2 docs being added to both
replicas of shard1 & shard2...{noformat}
1302696 INFO (qtp1904355206-7333) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard1_replica2] webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&wt=javabin&version=2}{add=[abc!447049285
(1529857595576156160)]} 0 4
1302697 INFO (qtp1409631559-7341) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard1_replica1] webapp=/solr path=/update
params={wt=javabin&version=2}{add=[abc!447049285 (1529857595576156160)]} 0 14
1302719 INFO (qtp797763926-7371) [n:127.0.0.1:51144_solr c:test_col s:shard2
r:core_node3 x:test_col_shard2_replica1] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard2_replica1] webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:33616/solr/test_col_shard2_replica2/&wt=javabin&version=2}{add=[XYZ!701047101
(1529857595596079104)]} 0 7
1302719 INFO (qtp316491708-7355) [n:127.0.0.1:33616_solr c:test_col s:shard2
r:core_node2 x:test_col_shard2_replica2] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard2_replica2] webapp=/solr path=/update
params={wt=javabin&version=2}{add=[XYZ!701047101 (1529857595596079104)]} 0 18
{noformat}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L189
** do a commit using the CloudSolrClient
** timestamps #1302724 - #1302733 show both replicas of shard1 acting on this
commit & opening new searchers...{noformat}
1302724 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
1302724 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
1302727 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.c.SolrDeletionPolicy
SolrDeletionPolicy.onCommit: commits: num=2
commit{dir=MockDirectoryWrapper(RAMDirectory@21c6a821
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26734241),segFN=segments_1,generation=1}
commit{dir=MockDirectoryWrapper(RAMDirectory@21c6a821
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26734241),segFN=segments_2,generation=2}
1302727 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.c.SolrDeletionPolicy
SolrDeletionPolicy.onCommit: commits: num=2
commit{dir=MockDirectoryWrapper(RAMDirectory@5616141f
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@6180bc4c),segFN=segments_1,generation=1}
commit{dir=MockDirectoryWrapper(RAMDirectory@5616141f
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@6180bc4c),segFN=segments_2,generation=2}
1302728 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.c.SolrDeletionPolicy newest
commit generation = 2
1302729 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.c.SolrDeletionPolicy newest
commit generation = 2
1302730 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.s.SolrIndexSearcher Opening
[Searcher@f331408[test_col_shard1_replica1] main]
1302730 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening
[Searcher@5d1967dd[test_col_shard1_replica2] main]
1302730 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.u.DirectUpdateHandler2
end_commit_flush
1302730 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.u.DirectUpdateHandler2
end_commit_flush
1302731 INFO (searcherExecutor-3283-thread-1-processing-n:127.0.0.1:56361_solr
x:test_col_shard1_replica1 s:shard1 c:test_col r:core_node1)
[n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.SolrCore [test_col_shard1_replica1]
Registered new searcher Searcher@f331408[test_col_shard1_replica1]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.1.0):c1)))}
1302732 INFO (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard1_replica1] webapp=/solr path=/update
params={update.distrib=FROMLEADER&waitSearcher=true&openSearcher=true&commit=true&softCommit=false&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&commit_end_point=true&wt=javabin&version=2&expungeDeletes=false}{commit=}
0 7
1302732 INFO (searcherExecutor-3284-thread-1-processing-n:127.0.0.1:34275_solr
x:test_col_shard1_replica2 s:shard1 c:test_col r:core_node4)
[n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.c.SolrCore [test_col_shard1_replica2]
Registered new searcher Searcher@5d1967dd[test_col_shard1_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.1.0):c1)))}
1302732 INFO (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1
r:core_node4 x:test_col_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard1_replica2] webapp=/solr path=/update
params={update.distrib=FROMLEADER&waitSearcher=true&openSearcher=true&commit=true&softCommit=false&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&commit_end_point=true&wt=javabin&version=2&expungeDeletes=false}{commit=}
0 8
1302733 INFO (qtp1409631559-7345) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory
[test_col_shard1_replica1] webapp=/solr path=/update
params={_stateVer_=test_col:7&waitSearcher=true&commit=true&softCommit=false&wt=javabin&version=2}{commit=}
0 11
{noformat}
** {color:red}What the hell hapened to shard2 ?! why isn't shard#2 logging
anything here?!{color}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L190
** do a {{\*:\*}} query using the CloudSolrClient
** goal is to assert that the {{\[shard\]}} value for each of our doc
(prefixes) matches the expectation the rest of the test will have
** timestamp #1302737 shows an error log from a shard1 node indicating that
both replicas of shard2 have just vanished completley...{noformat}
1302737 ERROR (qtp1409631559-7347) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: no servers hosting shard: shard2
at
org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:451)
at
org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:224)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:262)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2033)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at
org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:109)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:462)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:518)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:745)
1302737 INFO (qtp1409631559-7347) [n:127.0.0.1:56361_solr c:test_col s:shard1
r:core_node1 x:test_col_shard1_replica1] o.a.s.c.S.Request
[test_col_shard1_replica1] webapp=/solr path=/select
params={q=*:*&_stateVer_=test_col:7&fl=id,expected_shard_s,[shard]&wt=javabin&version=2}
status=503 QTime=0
{noformat}
** but 5ms latter, timestamp #1302742 shows shard2's core_node2 logging it's
successful response to the initial phase of the distributed query (although
with {{hits=0}} since it never processed the commit or opened a new searcher
after the documents where added) ...{noformat}
1302742 INFO (qtp316491708-7357) [n:127.0.0.1:33616_solr c:test_col s:shard2
r:core_node2 x:test_col_shard2_replica2] o.a.s.c.S.Request
[test_col_shard2_replica2] webapp=/solr path=/select
params={distrib=false&_stateVer_=test_col:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:33616/solr/test_col_shard2_replica2/|http://127.0.0.1:51144/solr/test_col_shard2_replica1/&rows=10&version=2&q=*:*&NOW=1458985896717&isShard=true&wt=javabin}
hits=0 status=0 QTime=0
{noformat}
* The test ultimately fails at because the {{\*:\*}} sanity check of numDocs=2
fails
** TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L192
* other then the {{hits=0}} logging mentioned above (#1302742), there is no
logging from either of the shard2 nodes between timestamp #1302719 when both
replicas accept the doc add, and #1302758 when the test starts shutting down
nodes and closing cores.
> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
> Key: SOLR-8914
> URL: https://issues.apache.org/jira/browse/SOLR-8914
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
> Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a
> MiniSolrCloudCluster & several clients are initialized -- before any code
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in
> the logs, so i'm filing this jira with my analysis in the hopes that someone
> else can help make sense of it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]