[jira] [Commented] (SOLR-9361) Concept of replica state being "down" is confusing and missleading (especially w/DELETEREPLICA)

Hoss Man (JIRA) Fri, 29 Jul 2016 16:09:50 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400199#comment-15400199
 ]


Hoss Man commented on SOLR-9361:
--------------------------------


Steps to "reproduce" the various confusion/problems...

* Use {{bin/solr -e cloud}} to create a cluster & collection with the following 
properties:
** 3 nodes
** accept default port numbers for all 3 nodes (8983, 7574, 8984)
** gettingstarted collection with 1 shard & 3 replicas using default 
data_driven_schema_configs

* Observe that the Cloud Graph UI should say you have 3 active nodes
** http://localhost:8983/solr/#/~cloud

* Observe that the CLUSTERSTATUS API should also agree that you have 3 live 
nodes and all 3 replicas of your (single) shard with a {{state="active"}} 
...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":10},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr";,
                "node_name":"127.0.1.1:8983_solr",
                "state":"active"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr";,
                "node_name":"127.0.1.1:7574_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr";,
                "node_name":"127.0.1.1:8984_solr",
                "state":"active"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":8,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8984_solr",
      "127.0.1.1:8983_solr",
      "127.0.1.1:7574_solr"]}}
{noformat}

* Now pick a port# that is _not_ 8983 (since that's running embedded ZK) and do 
an orderly shutdown: {noformat}
$ bin/solr stop -p 7574
Sending stop command to Solr running on port 7574 ... waiting 5 seconds to 
allow Jetty process 4214 to stop gracefully.
{noformat}

* If you reload the Cloud UI screen, you should now see the node you shutdown 
listed in light-grey -- which according to the key means "Gone" (as opposed to 
"Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud

* If you check the CLUSTERSTATUS API again it should now say you have 2 live 
nodes and 2 replicas with a {{state="active"}} while 1 replica has a 
state="down" ...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr";,
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr";,
                "node_name":"127.0.1.1:7574_solr",
                "state":"down"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr";,
                "node_name":"127.0.1.1:8984_solr",
                "state":"active"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":11,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8984_solr",
      "127.0.1.1:8983_solr"]}}
{noformat}

* {color:red}Our first point of confusion for most users: the Terminology used 
in the Cloud Admin UI screens disagress with the {{state}} values returned by 
the CLUSTERSTATUS API{color}

* Now pick the remaining port# that is _not_ 8983 (since that's still running 
embedded ZK) and simulate a "hard crash" of the process and/or 
machine:{noformat}
$ cat bin/solr-8984.pid
4386
$ kill -9 4386
{noformat}

* If you reload the Cloud UI screen, you should now see tha port 8983 is the 
only "Active" node, and both of the nodes we have shutdown/killed are listed in 
light-grey -- which as a reminder: according to the key means "Gone" (as 
opposed to "Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud

* {color:red}Our second potential point of confusion for users: no distinction 
in the Admin UI between a node that has been orderly shutdown (ex: for 
maintence) and a node that unexpectedly vanished from the cluster{color}

* If you check the CLUSTERSTATUS API again it should now say you have 1 live 
node and 1 replica with a {{state="active"}} while 2 replicas have a 
state="down" ...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr";,
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr";,
                "node_name":"127.0.1.1:7574_solr",
                "state":"down"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr";,
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":11,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}

* {color:red}Again: potential points of confusion for users:{color}
** {color:red}the Terminology used in the Cloud Admin UI screens disagress with 
the {{state}} values returned by the CLUSTERSTATUS API{color}
** {color:red}No distinction in the CLUSTERSTATUS response between a replica 
that has been orderly shutdown (ex: for maintence) vs unexpectedly vanished 
from the cluster{color}

* Let's assume the user is not concerned about either of the "down" replicas
** example: one of the machines had a hardware failure and is never coming 
back.  After being alerted to the crash by a monitoring system, it was realized 
that this cluster was overprovisioned anyway, and a second node was shutdown to 
repurpose the hardware

* now the user wants to "clean up" the cluster state and remove these replicas
** but since they've never done this before, they want to me careful not to 
accidently delete the only active replica, so they plan to set 
{{onlyIfDown=true}} when issuing their DELETEREPLICA command

* First they issue the DELETEREPLICA command for the replica that was on the 
node that was shutdown cleanly (7574 / core_node2 in my example above) 
...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node2&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":5133},
  "failure":{
    
"127.0.1.1:7574_solr":"org.apache.solr.client.solrj.SolrServerException:Server 
refused connection at: http://127.0.1.1:7574/solr"}}
{noformat}

* {color:red}Next point of confusion: why did they get a "Server refused 
connection" failure message? of course you can't connect, the server is down -- 
that's why the replica is being removed.{color}

* Now in a confused panic that maybe they screwed something up, the user checks 
the Cloud Admin UI & CLUSTERSTATUS
** Admin UI now longer shows the removed replica -- so hopefully the failure 
can be ignored?
*** http://localhost:8983/solr/#/~cloud
** CLUSTERSTATUS API also seems "ok" ? ... {noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr";,
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr";,
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":12,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}

* Fingers crossed that everything is actually ok, they issue the DELETEREPLICA 
command for the replica that was on the node that had a catastrophic failure 
(8984 / core_node3 in my example above) ...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
  "responseHeader":{
    "status":400,
    "QTime":26},
  "Operation deletereplica caused 
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
 Attempted to remove replica : gettingstarted/shard1/core_node3 with 
onlyIfDown='true', but state is 'active'",
  "exception":{
    "msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with 
onlyIfDown='true', but state is 'active'",
    "rspCode":400},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with 
onlyIfDown='true', but state is 'active'",
    "code":400}}
{noformat}

* {color:red}Now the user is completley baffled{color}
** {color:red}why is Solr complaining that {{gettingstarted/shard1/core_node3}} 
can't be removed with {{onlyIfDown='true'}} because {{state is 'active'}} 
???{color}
** {color:red}Neither the UI or the CLUSTERSTATUS API said the replica was up 
-- CLUSTERSTATUS explicitly said it was DOWN!{color}

* Frustrated, the user tries again -- this time with {{onlyIfDown=false}} 
assuming that that's the best option given the error message they 
recieved...{noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=false&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":5131},
  "failure":{
    
"127.0.1.1:8984_solr":"org.apache.solr.client.solrj.SolrServerException:Server 
refused connection at: http://127.0.1.1:8984/solr"}}
{noformat}

* {color:red}Another confusing "Server refused connection" failure message -- 
but at least now the Admin UI & CLUSTERSTATUS API agree that they don't know 
anything about either replica we wanted to remove...{color}
** http://localhost:8983/solr/#/~cloud
** {noformat}
$ curl 
'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr";,
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr";,
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":12,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}


> Concept of replica state being "down" is confusing and missleading 
> (especially w/DELETEREPLICA)
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9361
>                 URL: https://issues.apache.org/jira/browse/SOLR-9361
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> In this thread on solr-user, Jerome Yang pointed out some really confusing 
> behavior regarding a "down" node and DELETEREPLICA's behavior when a node is 
> not shutdown cleanly...
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3CCA+8Dz=26QuB5qNogG_GNXUU7Ru2JQQ94oH5qJvfztPvn+h=2...@mail.gmail.com%3E
> I'll post a comment in a momment with a detailed walk through of how 
> confusing the "state" of a node/replica can be when a machine crashes, but 
> the SUmmary highlights are...
> * Admin UI & CLUSTERSTATUS API use diff terminology to describe replicas 
> hoted on machines that can't be reached
> ** CLUSTERSTATUS API lists the status as "down"
> ** the Admin UI displays them as "Gone" (even though it also has an option 
> for "Down" which never seems to be used)
> * Neither Admin UI & CLUSTERSTATUS API distinguish replicas that on nodes 
> that were shutdown cleanly vs replicas on nodes that just vanished from the 
> cluster (ie: catastrophic failure / network partitioning)
> * DELETEREPLICA w/ {{onlyIfDown=true}} only works if a replica was shutdown 
> cleanly
> ** For a replica that was on a node that had catastrophic failure, Using 
> {{onlyIfDown=true}} causes an error that the replica {{state is 'active'}}
> *** This in spite of the fact that CLUSTERSTATUS API explicitly says 
> {{"state":"down"}} for that replica
> * DELETEREPLICA on any replica that was hosted on a node that is no longer up 
> (either because it was cleanly shutdown using & using {{onlyIfDown=true}} or 
> down for any reason and using {{onlyIfDown=false}} generates a failure that 
> "{{Server refused connection}}"
> ** This in spite of the fact that the DELETEREPLICA otherwise appears to have 
> succeded
> ...there are probably multiple underlying bugs here that are exponentially 
> worse in the context of eachother.  We should spin off new issues as needed 
> to track them once they are concretely identified, but i wanted to open this 
> "ubser issue" to capture the overall experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9361) Concept of replica state being "down" is confusing and missleading (especially w/DELETEREPLICA)

Reply via email to