[jira] [Commented] (SOLR-8914) inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster

Hoss Man (JIRA) Mon, 28 Mar 2016 18:53:53 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215237#comment-15215237
 ]


Hoss Man commented on SOLR-8914:
--------------------------------

Nice catch Scott!

FWIW, here's a one liner that pulls out all the "live nodes size" messages from 
the log and distills it down to just the timeestamp, threadIds (showing the 
port# of the jetty instance) and the current live node count...

{noformat}
grep "live nodes size" jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt | 
perl -ple 's/.*2> (\d+).*?\((.*?)\).*?o.a.s.c.c.ZkStateReader.*?live nodes 
size: (\[\d+\]).*/$1 $2 $3/'
{noformat}

using the output from that command as a refrence point, we can revist some of 
my earlier observations...

* NOTE: there are 5 nodes in this cluster, but only 2 shards x 2 replicas
* L122: collection creation
** when the 4 cores for the collection are created, the jetty node that does 
_NOT_ recieve any core is port #63099...{noformat}
1291030 INFO  
(OverseerThreadFactory-3272-thread-1-processing-n:127.0.0.1:34275_solr) 
[n:127.0.0.1:34275_solr    ] o.a.s.c.OverseerCollectionMessageHandler Creating 
core test_col_shard1_replica1 as part of shard shard1 of collection test_col on 
127.0.0.1:56361_solr
1291031 INFO  
(OverseerThreadFactory-3272-thread-1-processing-n:127.0.0.1:34275_solr) 
[n:127.0.0.1:34275_solr    ] o.a.s.c.OverseerCollectionMessageHandler Creating 
core test_col_shard2_replica2 as part of shard shard2 of collection test_col on 
127.0.0.1:33616_solr
1291038 INFO  
(OverseerThreadFactory-3272-thread-1-processing-n:127.0.0.1:34275_solr) 
[n:127.0.0.1:34275_solr    ] o.a.s.c.OverseerCollectionMessageHandler Creating 
core test_col_shard2_replica1 as part of shard shard2 of collection test_col on 
127.0.0.1:51144_solr
1291041 INFO  
(OverseerThreadFactory-3272-thread-1-processing-n:127.0.0.1:34275_solr) 
[n:127.0.0.1:34275_solr    ] o.a.s.c.OverseerCollectionMessageHandler Creating 
core test_col_shard1_replica2 as part of shard shard1 of collection test_col on 
127.0.0.1:34275_solr
{noformat}
* L129: call to AbstractDistribZkTestBase.waitForRecoveriesToFinish 
** this happens between timestamp #1302667 - #1302668
** the most recent flurry of log messages mentioning "live nodes size" prior to 
this timestamp window are...{noformat}
1301787 zkCallback-1136-thread-1-processing-n:127.0.0.1:34275_solr [5]
1301788 zkCallback-1142-thread-2-processing-n:127.0.0.1:51144_solr [5]
1301791 zkCallback-1143-thread-1-processing-n:127.0.0.1:33616_solr [5]
1301791 zkCallback-1135-thread-2-processing-n:127.0.0.1:56361_solr [3]
{noformat}
** NOTE: these log messages are watches on 
{{path:/collections/test_col/state.json}} which port#63099 has no reason to 
watch (since it's not hosting any shards of {{test_col}} so it's not unusual 
that it's not logging anything around these times
** {color:red}why does port #56361 think at this point that there are only 3 
live nodes?{color}
* L184 - L188: adding docs
** this is done with CloudSolrClient, so each doc goes to it's respective 
leader regardless of wether any 1 node has a broken view of live_nodes
** port #56361 hosts test_col_shard1_replica1 which is aparently the leader of 
shard1 at this moment
** the shard1 doc is forwarded correctly to test_col_shard1_replica2 (on port 
#34275)
*** so #34275 is evidently in port #56361's "view" of live_nodes, and _not_ one 
of the "missing" live_nodes
* L189: commit
** happens during timestamps #1302724 - #1302733
** (no new "live nodes size" log messages prior to this time window since the 
ones mentioned previously)
** IIUC CloudSolrClient correctly, this is considered a "non routable" update, 
so it will be sent to a (leader) node picked at random
** from the logs, it looks like CloudSolrClient picked port #56361
** port #56361's test_col_shard1_replica1 forwarded to it's replica 
test_col_shard1_replica2 on port #34275
** as mentioned before shard2 never got the update - aparently because port 
#56361 didn't think that shard2's leader was in live_nodes
* L190: the {{\*:\*} query
** timestamp #1302737 is when port #56361 (who we know has a bogus view of 
live_nodes) logs that no servesr are hosting shard2, and returns a 503 error 
code
** (there are still no new "live nodes size" log messages prior to this error 
since the ones mentioned previously)
** Something that didn't occur to me before, is that IIRC 
CloudSolrServertimestamp will retry some error codes -- i'm guessing 503 is one 
of them, because that would explain the subsequent log messages 5ms later...
** timestamps #1302742 - 1302746, it appears that port #34275 was queried by 
CloudSolrClient on retry, and it did know that both replicas of shard2 were 
alive and queried them (but since they never got the commit, they had 
numFound=0){noformat}
1302742 INFO  (qtp316491708-7357) [n:127.0.0.1:33616_solr c:test_col s:shard2 
r:core_node2 x:test_col_shard2_replica2] o.a.s.c.S.Request 
[test_col_shard2_replica2]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:33616/solr/test_col_shard2_replica2/|http://127.0.0.1:51144/solr/test_col_shard2_replica1/&rows=10&version=2&q=*:*&NOW=1458985896717&isShard=true&wt=javabin}
 hits=0 status=0 QTime=0
1302742 INFO  (qtp1409631559-7346) [n:127.0.0.1:56361_solr c:test_col s:shard1 
r:core_node1 x:test_col_shard1_replica1] o.a.s.c.S.Request 
[test_col_shard1_replica1]  webapp=/solr path=/select 
params={distrib=false&_stateVer_=test_col:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:56361/solr/test_col_shard1_replica1/|http://127.0.0.1:34275/solr/test_col_shard1_replica2/&rows=10&version=2&q=*:*&NOW=1458985896717&isShard=true&wt=javabin}
 hits=1 status=0 QTime=0
1302745 INFO  (qtp1409631559-7344) [n:127.0.0.1:56361_solr c:test_col s:shard1 
r:core_node1 x:test_col_shard1_replica1] o.a.s.c.S.Request 
[test_col_shard1_replica1]  webapp=/solr path=/select 
params={q=*:*&distrib=false&_stateVer_=test_col:7&fl=id,expected_shard_s,[shard]&shards.purpose=64&NOW=1458985896717&ids=abc!447049285&isShard=true&shard.url=http://127.0.0.1:56361/solr/test_col_shard1_replica1/|http://127.0.0.1:34275/solr/test_col_shard1_replica2/&wt=javabin&version=2}
 status=0 QTime=0
1302746 INFO  (qtp1904355206-7332) [n:127.0.0.1:34275_solr c:test_col s:shard1 
r:core_node4 x:test_col_shard1_replica2] o.a.s.c.S.Request 
[test_col_shard1_replica2]  webapp=/solr path=/select 
params={q=*:*&_stateVer_=test_col:7&fl=id,expected_shard_s,[shard]&wt=javabin&version=2}
 hits=1 status=0 QTime=7
{noformat}


So to sum up: it appears everything i noted before would be easily explained by 
a some broken code regarding live nodes that only affected port #56361 while 
all other nodes had a perfecly accurate view of the world.


> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the 
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a 
> MiniSolrCloudCluster & several clients are initialized -- before any code 
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in 
> the logs, so i'm filing this jira with my analysis in the hopes that someone 
> else can help make sense of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8914) inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster

Reply via email to