[ 
https://issues.apache.org/jira/browse/SOLR-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patson Luk updated SOLR-16506:
------------------------------
    Description: 
h2. Description

We have a scenario which 2 nodes (n1, n2) have under the data folder 
(`solr_data`) the same core name, both folders have `core.properties` but ONLY 
n1 has the data folder. And in the state.json for such collection, such 
core/replica has  `node_name` and `base_url` pointing at n1. 

Therefore n1 is the real node hosting the replica, we are not quite sure how we 
got to such state - could be from some migration failure. We call the replica 
on n2 the "ghost replica".

Now if we restart n2, it will actually took over such replica and even deletes 
the data from n1:

#  `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the 
cores hosted on this node by walking through the solr data directory. It finds 
the ghost core and creates a `CoreDescriptor` for it
# `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out of 
the`CoreDescriptor`
# `ZkController#preRegister` is called for such `CoreDescription`, which at 
would publish the replica state as `DOWN`, take note that usually 
`isPublishAsDownOnStartup` should return false, [but in this case it returns 
`true`|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1988]
 as `replica.getNodeName().equals(getNodeName())` is `false`
#  During `ZkController#publish`, it will [publish the 
state.json|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1771]
 with incorrect `base_url` and `node_name` (n2)
# Once the state.json is updated with the incorrect values, it 
triggers`UnloadCoreOnDeleteWatcher`, which 
[unload/delete|https://github.com/apache/solr/blob/b8ca0ce23e2ebe1b33c85b71fc61ab9cf8411a35/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2865]
 the core. It will also later publish `DELETECORE` to remove such core from zk

h2. Solution
It seems rather risky to update the state.json and publish such replica as down 
if such core does exist in the state.json but with different node name. 

Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we should 
interrupt the core loading if current node name is different from zookeeper 
state.json's value. Such that it should not attempt to publish DOWN to such 
replica and update the state.json, which possibly is the wrong node name



  was:
## Description

We have a scenario which 2 nodes (n1, n2) have under the data folder 
(`solr_data`) the same core name, both folders have `core.properties` but ONLY 
n1 has the data folder. And in the state.json for such collection, such 
core/replica has  `node_name` and `base_url` pointing at n1. 

Therefore n1 is the real node hosting the replica, we are not quite sure how we 
got to such state - could be from some migration failure. We call the replica 
on n2 the "ghost replica".

Now if we restart n2, it will actually took over such replica and even deletes 
the data from n1:
1. `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the 
cores hosted on this node by walking through the solr data directory. It finds 
the ghost core and creates a `CoreDescriptor` for it
2. `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out 
of the`CoreDescriptor`
3. `ZkController#preRegister` is called for such `CoreDescription`, which at 
would publish the replica state as `DOWN`, take note that usually 
`isPublishAsDownOnStartup` should return false, [but in this case it returns 
`true`|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1988]
 as `replica.getNodeName().equals(getNodeName())` is `false`
4.  During `ZkController#publish`, it will [publish the 
state.json|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1771]
 with incorrect `base_url` and `node_name` (n2)
5. Once the state.json is updated with the incorrect values, it 
triggers`UnloadCoreOnDeleteWatcher`, which 
[unload/delete|https://github.com/apache/solr/blob/b8ca0ce23e2ebe1b33c85b71fc61ab9cf8411a35/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2865]
 the core. It will also later publish `DELETECORE` to remove such core from zk

## Solution
It seems rather risky to update the state.json and publish such replica as down 
if such core does exist in the state.json but with different node name. 

Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we should 
interrupt the core loading if current node name is different from zookeeper 
state.json's value. Such that it should not attempt to publish DOWN to such 
replica and update the state.json, which possibly is the wrong node name




> Flag exception during startup if replica node name does not match zk info
> -------------------------------------------------------------------------
>
>                 Key: SOLR-16506
>                 URL: https://issues.apache.org/jira/browse/SOLR-16506
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 9.1
>            Reporter: Patson Luk
>            Priority: Blocker
>
> h2. Description
> We have a scenario which 2 nodes (n1, n2) have under the data folder 
> (`solr_data`) the same core name, both folders have `core.properties` but 
> ONLY n1 has the data folder. And in the state.json for such collection, such 
> core/replica has  `node_name` and `base_url` pointing at n1. 
> Therefore n1 is the real node hosting the replica, we are not quite sure how 
> we got to such state - could be from some migration failure. We call the 
> replica on n2 the "ghost replica".
> Now if we restart n2, it will actually took over such replica and even 
> deletes the data from n1:
> #  `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the 
> cores hosted on this node by walking through the solr data directory. It 
> finds the ghost core and creates a `CoreDescriptor` for it
> # `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out 
> of the`CoreDescriptor`
> # `ZkController#preRegister` is called for such `CoreDescription`, which at 
> would publish the replica state as `DOWN`, take note that usually 
> `isPublishAsDownOnStartup` should return false, [but in this case it returns 
> `true`|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1988]
>  as `replica.getNodeName().equals(getNodeName())` is `false`
> #  During `ZkController#publish`, it will [publish the 
> state.json|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1771]
>  with incorrect `base_url` and `node_name` (n2)
> # Once the state.json is updated with the incorrect values, it 
> triggers`UnloadCoreOnDeleteWatcher`, which 
> [unload/delete|https://github.com/apache/solr/blob/b8ca0ce23e2ebe1b33c85b71fc61ab9cf8411a35/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2865]
>  the core. It will also later publish `DELETECORE` to remove such core from zk
> h2. Solution
> It seems rather risky to update the state.json and publish such replica as 
> down if such core does exist in the state.json but with different node name. 
> Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we 
> should interrupt the core loading if current node name is different from 
> zookeeper state.json's value. Such that it should not attempt to publish DOWN 
> to such replica and update the state.json, which possibly is the wrong node 
> name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to