[jira] [Updated] (SOLR-9092) Add safety checks to delete replica/shard/collection commands

2016-08-13 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated SOLR-9092:

Attachment: SOLR-9092.patch

Updated patch which compiles against master.

All the tests and precommit passes. I'll commit this now

> Add safety checks to delete replica/shard/collection commands
> -
>
> Key: SOLR-9092
> URL: https://issues.apache.org/jira/browse/SOLR-9092
> Project: Solr
>  Issue Type: Improvement
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Minor
> Attachments: SOLR-9092.patch, SOLR-9092.patch
>
>
> We should verify the delete commands against live_nodes to make sure the API 
> can atleast be executed correctly
> If we have a two node cluster, a collection with 1 shard 2 replica. Call the 
> delete replica command against for the replica whose node is currently down.
> You get an exception:
> {code}
> 
>
>   0
>   5173
>
>
>name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: http://192.168.1.101:7574/solr
>
> 
> {code}
> At this point the entry for the replica is gone from state.json . The client 
> application retries since an error was thrown but the delete command will 
> never succeed now and an error like this will be seen-
> {code}
> 
>
>   400
>   137
>
>org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Invalid replica : core_node3 in shard/collection : shard1/gettingstarted 
> available replicas are core_node1
>
>   Invalid replica : core_node3 in shard/collection : 
> shard1/gettingstarted available replicas are core_node1
>   400
>
>
>   
>  org.apache.solr.common.SolrException
>   name="root-error-class">org.apache.solr.common.SolrException
>   
>   Invalid replica : core_node3 in shard/collection : 
> shard1/gettingstarted available replicas are core_node1
>   400
>
> 
> {code}
> For create collection/add-replica we check the "createNodeSet" and "node" 
> params respectively against live_nodes to make sure it has a chance of 
> succeeding.
> We should add a check against live_nodes for the delete commands as well.
> Another situation where I saw this can be a problem - A second solr cluster 
> cloned from the first but the script didn't correctly change the hostnames in 
> the state.json file. When a delete command was issued against the second 
> cluster Solr deleted the replica from the first cluster.
> In the above case the script was buggy obviously but if we verify against 
> live_nodes then Solr wouldn't have gone ahead and deleted replicas not 
> belonging to its cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9092) Add safety checks to delete replica/shard/collection commands

2016-08-10 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated SOLR-9092:

Attachment: SOLR-9092.patch

I started to revisit this issue. Took a quick crack at it 

The approach I've taken is - before sending the core UNLOAD command verify if 
the core is part of a live node. If the node isn't part of live_nodes then it 
would fail anyways. So this check won't break back-compat. The cluster state 
info still gets deleted. It will protect us from the second scenario mentiond 
on the ticket

bq. Another situation where I saw this can be a problem - A second solr cluster 
cloned from the first but the script didn't correctly change the hostnames in 
the state.json file. When a delete command was issued against the second 
cluster Solr deleted the replica from the first cluster. In the above case the 
script was buggy obviously but if we verify against live_nodes then Solr 
wouldn't have gone ahead and deleted replicas not belonging to its cluster.

Just wanted to get some feedback before writing tests etc.

> Add safety checks to delete replica/shard/collection commands
> -
>
> Key: SOLR-9092
> URL: https://issues.apache.org/jira/browse/SOLR-9092
> Project: Solr
>  Issue Type: Improvement
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Minor
> Attachments: SOLR-9092.patch
>
>
> We should verify the delete commands against live_nodes to make sure the API 
> can atleast be executed correctly
> If we have a two node cluster, a collection with 1 shard 2 replica. Call the 
> delete replica command against for the replica whose node is currently down.
> You get an exception:
> {code}
> 
>
>   0
>   5173
>
>
>name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: http://192.168.1.101:7574/solr
>
> 
> {code}
> At this point the entry for the replica is gone from state.json . The client 
> application retries since an error was thrown but the delete command will 
> never succeed now and an error like this will be seen-
> {code}
> 
>
>   400
>   137
>
>org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Invalid replica : core_node3 in shard/collection : shard1/gettingstarted 
> available replicas are core_node1
>
>   Invalid replica : core_node3 in shard/collection : 
> shard1/gettingstarted available replicas are core_node1
>   400
>
>
>   
>  org.apache.solr.common.SolrException
>   name="root-error-class">org.apache.solr.common.SolrException
>   
>   Invalid replica : core_node3 in shard/collection : 
> shard1/gettingstarted available replicas are core_node1
>   400
>
> 
> {code}
> For create collection/add-replica we check the "createNodeSet" and "node" 
> params respectively against live_nodes to make sure it has a chance of 
> succeeding.
> We should add a check against live_nodes for the delete commands as well.
> Another situation where I saw this can be a problem - A second solr cluster 
> cloned from the first but the script didn't correctly change the hostnames in 
> the state.json file. When a delete command was issued against the second 
> cluster Solr deleted the replica from the first cluster.
> In the above case the script was buggy obviously but if we verify against 
> live_nodes then Solr wouldn't have gone ahead and deleted replicas not 
> belonging to its cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org