[
https://issues.apache.org/jira/browse/SOLR-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400199#comment-15400199
]
Hoss Man commented on SOLR-9361:
--------------------------------
Steps to "reproduce" the various confusion/problems...
* Use {{bin/solr -e cloud}} to create a cluster & collection with the following
properties:
** 3 nodes
** accept default port numbers for all 3 nodes (8983, 7574, 8984)
** gettingstarted collection with 1 shard & 3 replicas using default
data_driven_schema_configs
* Observe that the Cloud Graph UI should say you have 3 active nodes
** http://localhost:8983/solr/#/~cloud
* Observe that the CLUSTERSTATUS API should also agree that you have 3 live
nodes and all 3 replicas of your (single) shard with a {{state="active"}}
...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":10},
"cluster":{
"collections":{
"gettingstarted":{
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"core":"gettingstarted_shard1_replica2",
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr",
"state":"active"},
"core_node2":{
"core":"gettingstarted_shard1_replica1",
"base_url":"http://127.0.1.1:7574/solr",
"node_name":"127.0.1.1:7574_solr",
"state":"active",
"leader":"true"},
"core_node3":{
"core":"gettingstarted_shard1_replica3",
"base_url":"http://127.0.1.1:8984/solr",
"node_name":"127.0.1.1:8984_solr",
"state":"active"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":8,
"configName":"gettingstarted"}},
"live_nodes":["127.0.1.1:8984_solr",
"127.0.1.1:8983_solr",
"127.0.1.1:7574_solr"]}}
{noformat}
* Now pick a port# that is _not_ 8983 (since that's running embedded ZK) and do
an orderly shutdown: {noformat}
$ bin/solr stop -p 7574
Sending stop command to Solr running on port 7574 ... waiting 5 seconds to
allow Jetty process 4214 to stop gracefully.
{noformat}
* If you reload the Cloud UI screen, you should now see the node you shutdown
listed in light-grey -- which according to the key means "Gone" (as opposed to
"Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud
* If you check the CLUSTERSTATUS API again it should now say you have 2 live
nodes and 2 replicas with a {{state="active"}} while 1 replica has a
state="down" ...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":1},
"cluster":{
"collections":{
"gettingstarted":{
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"core":"gettingstarted_shard1_replica2",
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr",
"state":"active",
"leader":"true"},
"core_node2":{
"core":"gettingstarted_shard1_replica1",
"base_url":"http://127.0.1.1:7574/solr",
"node_name":"127.0.1.1:7574_solr",
"state":"down"},
"core_node3":{
"core":"gettingstarted_shard1_replica3",
"base_url":"http://127.0.1.1:8984/solr",
"node_name":"127.0.1.1:8984_solr",
"state":"active"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":11,
"configName":"gettingstarted"}},
"live_nodes":["127.0.1.1:8984_solr",
"127.0.1.1:8983_solr"]}}
{noformat}
* {color:red}Our first point of confusion for most users: the Terminology used
in the Cloud Admin UI screens disagress with the {{state}} values returned by
the CLUSTERSTATUS API{color}
* Now pick the remaining port# that is _not_ 8983 (since that's still running
embedded ZK) and simulate a "hard crash" of the process and/or
machine:{noformat}
$ cat bin/solr-8984.pid
4386
$ kill -9 4386
{noformat}
* If you reload the Cloud UI screen, you should now see tha port 8983 is the
only "Active" node, and both of the nodes we have shutdown/killed are listed in
light-grey -- which as a reminder: according to the key means "Gone" (as
opposed to "Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud
* {color:red}Our second potential point of confusion for users: no distinction
in the Admin UI between a node that has been orderly shutdown (ex: for
maintence) and a node that unexpectedly vanished from the cluster{color}
* If you check the CLUSTERSTATUS API again it should now say you have 1 live
node and 1 replica with a {{state="active"}} while 2 replicas have a
state="down" ...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":2},
"cluster":{
"collections":{
"gettingstarted":{
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"core":"gettingstarted_shard1_replica2",
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr",
"state":"active",
"leader":"true"},
"core_node2":{
"core":"gettingstarted_shard1_replica1",
"base_url":"http://127.0.1.1:7574/solr",
"node_name":"127.0.1.1:7574_solr",
"state":"down"},
"core_node3":{
"core":"gettingstarted_shard1_replica3",
"base_url":"http://127.0.1.1:8984/solr",
"node_name":"127.0.1.1:8984_solr",
"state":"down"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":11,
"configName":"gettingstarted"}},
"live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}
* {color:red}Again: potential points of confusion for users:{color}
** {color:red}the Terminology used in the Cloud Admin UI screens disagress with
the {{state}} values returned by the CLUSTERSTATUS API{color}
** {color:red}No distinction in the CLUSTERSTATUS response between a replica
that has been orderly shutdown (ex: for maintence) vs unexpectedly vanished
from the cluster{color}
* Let's assume the user is not concerned about either of the "down" replicas
** example: one of the machines had a hardware failure and is never coming
back. After being alerted to the crash by a monitoring system, it was realized
that this cluster was overprovisioned anyway, and a second node was shutdown to
repurpose the hardware
* now the user wants to "clean up" the cluster state and remove these replicas
** but since they've never done this before, they want to me careful not to
accidently delete the only active replica, so they plan to set
{{onlyIfDown=true}} when issuing their DELETEREPLICA command
* First they issue the DELETEREPLICA command for the replica that was on the
node that was shutdown cleanly (7574 / core_node2 in my example above)
...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node2&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":5133},
"failure":{
"127.0.1.1:7574_solr":"org.apache.solr.client.solrj.SolrServerException:Server
refused connection at: http://127.0.1.1:7574/solr"}}
{noformat}
* {color:red}Next point of confusion: why did they get a "Server refused
connection" failure message? of course you can't connect, the server is down --
that's why the replica is being removed.{color}
* Now in a confused panic that maybe they screwed something up, the user checks
the Cloud Admin UI & CLUSTERSTATUS
** Admin UI now longer shows the removed replica -- so hopefully the failure
can be ignored?
*** http://localhost:8983/solr/#/~cloud
** CLUSTERSTATUS API also seems "ok" ? ... {noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
"responseHeader":{
"status":0,
"QTime":2},
"cluster":{
"collections":{
"gettingstarted":{
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"core":"gettingstarted_shard1_replica2",
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr",
"state":"active",
"leader":"true"},
"core_node3":{
"core":"gettingstarted_shard1_replica3",
"base_url":"http://127.0.1.1:8984/solr",
"node_name":"127.0.1.1:8984_solr",
"state":"down"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":12,
"configName":"gettingstarted"}},
"live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}
* Fingers crossed that everything is actually ok, they issue the DELETEREPLICA
command for the replica that was on the node that had a catastrophic failure
(8984 / core_node3 in my example above) ...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
"responseHeader":{
"status":400,
"QTime":26},
"Operation deletereplica caused
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Attempted to remove replica : gettingstarted/shard1/core_node3 with
onlyIfDown='true', but state is 'active'",
"exception":{
"msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with
onlyIfDown='true', but state is 'active'",
"rspCode":400},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with
onlyIfDown='true', but state is 'active'",
"code":400}}
{noformat}
* {color:red}Now the user is completley baffled{color}
** {color:red}why is Solr complaining that {{gettingstarted/shard1/core_node3}}
can't be removed with {{onlyIfDown='true'}} because {{state is 'active'}}
???{color}
** {color:red}Neither the UI or the CLUSTERSTATUS API said the replica was up
-- CLUSTERSTATUS explicitly said it was DOWN!{color}
* Frustrated, the user tries again -- this time with {{onlyIfDown=false}}
assuming that that's the best option given the error message they
recieved...{noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=false&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":5131},
"failure":{
"127.0.1.1:8984_solr":"org.apache.solr.client.solrj.SolrServerException:Server
refused connection at: http://127.0.1.1:8984/solr"}}
{noformat}
* {color:red}Another confusing "Server refused connection" failure message --
but at least now the Admin UI & CLUSTERSTATUS API agree that they don't know
anything about either replica we wanted to remove...{color}
** http://localhost:8983/solr/#/~cloud
** {noformat}
$ curl
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
"responseHeader":{
"status":0,
"QTime":2},
"cluster":{
"collections":{
"gettingstarted":{
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"core":"gettingstarted_shard1_replica2",
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr",
"state":"active",
"leader":"true"},
"core_node3":{
"core":"gettingstarted_shard1_replica3",
"base_url":"http://127.0.1.1:8984/solr",
"node_name":"127.0.1.1:8984_solr",
"state":"down"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"znodeVersion":12,
"configName":"gettingstarted"}},
"live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}
> Concept of replica state being "down" is confusing and missleading
> (especially w/DELETEREPLICA)
> -----------------------------------------------------------------------------------------------
>
> Key: SOLR-9361
> URL: https://issues.apache.org/jira/browse/SOLR-9361
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
>
> In this thread on solr-user, Jerome Yang pointed out some really confusing
> behavior regarding a "down" node and DELETEREPLICA's behavior when a node is
> not shutdown cleanly...
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3CCA+8Dz=26QuB5qNogG_GNXUU7Ru2JQQ94oH5qJvfztPvn+h=2...@mail.gmail.com%3E
> I'll post a comment in a momment with a detailed walk through of how
> confusing the "state" of a node/replica can be when a machine crashes, but
> the SUmmary highlights are...
> * Admin UI & CLUSTERSTATUS API use diff terminology to describe replicas
> hoted on machines that can't be reached
> ** CLUSTERSTATUS API lists the status as "down"
> ** the Admin UI displays them as "Gone" (even though it also has an option
> for "Down" which never seems to be used)
> * Neither Admin UI & CLUSTERSTATUS API distinguish replicas that on nodes
> that were shutdown cleanly vs replicas on nodes that just vanished from the
> cluster (ie: catastrophic failure / network partitioning)
> * DELETEREPLICA w/ {{onlyIfDown=true}} only works if a replica was shutdown
> cleanly
> ** For a replica that was on a node that had catastrophic failure, Using
> {{onlyIfDown=true}} causes an error that the replica {{state is 'active'}}
> *** This in spite of the fact that CLUSTERSTATUS API explicitly says
> {{"state":"down"}} for that replica
> * DELETEREPLICA on any replica that was hosted on a node that is no longer up
> (either because it was cleanly shutdown using & using {{onlyIfDown=true}} or
> down for any reason and using {{onlyIfDown=false}} generates a failure that
> "{{Server refused connection}}"
> ** This in spite of the fact that the DELETEREPLICA otherwise appears to have
> succeded
> ...there are probably multiple underlying bugs here that are exponentially
> worse in the context of eachother. We should spin off new issues as needed
> to track them once they are concretely identified, but i wanted to open this
> "ubser issue" to capture the overall experience.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]