[ 
https://issues.apache.org/jira/browse/SOLR-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patson Luk resolved SOLR-16506.
-------------------------------
    Resolution: Won't Fix

Mismatching node name could happen when -Dhost is used, under such 
circumstances, it is desirable to publish a down state with the updated node 
name (this probably explains the check here).

Closing this proposal as this change could prevent the correct handling.

> Flag exception during startup if replica node name does not match zk info
> -------------------------------------------------------------------------
>
>                 Key: SOLR-16506
>                 URL: https://issues.apache.org/jira/browse/SOLR-16506
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>    Affects Versions: 9.1
>            Reporter: Patson Luk
>            Priority: Blocker
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> h2. Description
> We have a scenario which 2 nodes (n1, n2) have under the data folder 
> (`solr_data`) the same core name, both folders have `core.properties` but 
> ONLY n1 has the data folder. And in the state.json for such collection, such 
> core/replica has `node_name` and `base_url` pointing at n1.
> Therefore n1 is the real node hosting the replica, we are not quite sure how 
> we got to such state - could be from some migration failure. We call the 
> replica on n2 the "ghost replica".
> Now if we restart n2, it will actually took over such replica and even 
> deletes the data from n1:
>  # `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the 
> cores hosted on this node by walking through the solr data directory. It 
> finds the ghost core and creates a `CoreDescriptor` for it
>  # `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out 
> of the`CoreDescriptor`
>  # `ZkController#preRegister` is called for such `CoreDescription`, which at 
> would publish the replica state as `DOWN`, take note that usually 
> `isPublishAsDownOnStartup` should return false, [but in this case it returns 
> `true`|https://github.com/apache/solr/blob/11253f05cfb31f9fb945c831d8889b3db1e607f1/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2035]
>  as `replica.getNodeName().equals(getNodeName())` is `false`
>  # During `ZkController#publish`, it will [publish the 
> state.json|https://github.com/apache/solr/blob/11253f05cfb31f9fb945c831d8889b3db1e607f1/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1784]
>  with incorrect `base_url` and `node_name` (n2)
>  # Once the state.json is updated with the incorrect values, it 
> triggers`UnloadCoreOnDeleteWatcher`, which 
> [unload/delete|https://github.com/apache/solr/blob/b8ca0ce23e2ebe1b33c85b71fc61ab9cf8411a35/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2865]
>  the core. It will also later publish `DELETECORE` to remove such core from zk
> h2. Solution
> It seems rather risky to update the state.json and publish such replica as 
> down if such core does exist in the state.json but with different node name.
> Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we 
> should interrupt the core loading if current node name is different from 
> zookeeper state.json's value. Such that it should not attempt to publish DOWN 
> to such replica and update the state.json, which possibly is the wrong node 
> name
>  
> h2. Remarks
> With the proposed change, Solr will no longer "auto-correct" the state.json 
> on startup if there's node name mismatch, no sure if that's a desirable 
> behavior though. Some changes are made to unit test case so test restart 
> would not change port number (ie changing the node name)
> Would love to get some input here!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to