[ 
https://issues.apache.org/jira/browse/SOLR-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458279#comment-16458279
 ] 

Varun Thacker edited comment on SOLR-12291 at 4/30/18 2:46 AM:
---------------------------------------------------------------

Here's a simple way to reproduce this:

 

Start Solr with
{code:java}
bin/solr start -e gettingstarted -noprompt{code}
Call collection reload with async
{code:java}
http://localhost:8983/solr/admin/collections?action=reload&name=gettingstarted&async=1{code}
Call request status
{code:java}
http://localhost:8983/solr/admin/collections?action=requeststatus&requestid=1{code}
Here's the output ( it's a miracle our users are able to make some sense of 
this response ;) ) 

 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":5},
  "success":{
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":1}},
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":58}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":59}}},
  "167115373362154":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373362154 webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=58"},
  "167115373632133":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373632133 webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=0"},
  "status":{
    "state":"completed",
    "msg":"found [1] in completed tasks"}}{code}
Notice how we only track 2 task id's ..It should be 4 because the 
gettingstarted example creates a 2X2 collection.

 

Also if you check the logs there were 4 tasks but the overseer didn't track 2 
of them 
{code:java}
~/solr-7.3.0$ grep "/admin/cores" example/cloud/node1/logs/solr.log | grep -i 
reload

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard1_replica_n2&async=167115373148039&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=1

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=0

~/solr-7.3.0$ grep "/admin/cores" example/cloud/node2/logs/solr.log | grep -i 
reload

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard1_replica_n1&async=167115372921360&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=59

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=58{code}


was (Author: varunthacker):
Here's a simple way to reproduce this:

 

Start Solr with
{code:java}
bin/solr start -e gettingstarted -noprompt{code}
Call collection reload with async
{code:java}
http://localhost:8983/solr/admin/collections?action=reload&name=gettingstarted&async=1{code}
Call request status
{code:java}
http://localhost:8983/solr/admin/collections?action=requeststatus&requestid=1{code}
Here's the output

 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":5},
  "success":{
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":1}},
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":58}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":59}}},
  "167115373362154":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373362154 webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=58"},
  "167115373632133":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373632133 webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=0"},
  "status":{
    "state":"completed",
    "msg":"found [1] in completed tasks"}}{code}
Notice how we only track 2 task id's ..It should be 4 because the 
gettingstarted example creates a 2X2 collection.

 

Also if you check the logs there were 4 tasks but the overseer didn't track 2 
of them 
{code:java}
~/solr-7.3.0$ grep "/admin/cores" example/cloud/node1/logs/solr.log | grep -i 
reload

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard1_replica_n2&async=167115373148039&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=1

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=0

~/solr-7.3.0$ grep "/admin/cores" example/cloud/node2/logs/solr.log | grep -i 
reload

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard1_replica_n1&async=167115372921360&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=59

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; 
[admin] webapp=null path=/admin/cores 
params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2}
 status=0 QTime=58{code}

> OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on 
> each node
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-12291
>                 URL: https://issues.apache.org/jira/browse/SOLR-12291
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore, SolrCloud
>            Reporter: Varun Thacker
>            Priority: Major
>
> The OverseerCollectionMessageHandler sliceCmd assumes only one replica exists 
> on one node
> When multiple replicas of a slice are on the same node we only track one 
> replica's async request. This happens because the async requestMap's key is 
> "node_name"
> I discovered this when [~alabax] shared some logs of a restore issue, where 
> the second replica got added before the first replica had completed it's 
> restorecore action.
> While looking at the logs I noticed that the overseer never called 
> REQUESTSTATUS for the restorecore action , almost as if it had missed 
> tracking that particular async request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to