[
https://issues.apache.org/jira/browse/SOLR-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patson Luk resolved SOLR-16506.
-------------------------------
Resolution: Won't Fix
Mismatching node name could happen when -Dhost is used, under such
circumstances, it is desirable to publish a down state with the updated node
name (this probably explains the check here).
Closing this proposal as this change could prevent the correct handling.
> Flag exception during startup if replica node name does not match zk info
> -------------------------------------------------------------------------
>
> Key: SOLR-16506
> URL: https://issues.apache.org/jira/browse/SOLR-16506
> Project: Solr
> Issue Type: Improvement
> Components: SolrCloud
> Affects Versions: 9.1
> Reporter: Patson Luk
> Priority: Blocker
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> h2. Description
> We have a scenario which 2 nodes (n1, n2) have under the data folder
> (`solr_data`) the same core name, both folders have `core.properties` but
> ONLY n1 has the data folder. And in the state.json for such collection, such
> core/replica has `node_name` and `base_url` pointing at n1.
> Therefore n1 is the real node hosting the replica, we are not quite sure how
> we got to such state - could be from some migration failure. We call the
> replica on n2 the "ghost replica".
> Now if we restart n2, it will actually took over such replica and even
> deletes the data from n1:
> # `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the
> cores hosted on this node by walking through the solr data directory. It
> finds the ghost core and creates a `CoreDescriptor` for it
> # `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out
> of the`CoreDescriptor`
> # `ZkController#preRegister` is called for such `CoreDescription`, which at
> would publish the replica state as `DOWN`, take note that usually
> `isPublishAsDownOnStartup` should return false, [but in this case it returns
> `true`|https://github.com/apache/solr/blob/11253f05cfb31f9fb945c831d8889b3db1e607f1/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2035]
> as `replica.getNodeName().equals(getNodeName())` is `false`
> # During `ZkController#publish`, it will [publish the
> state.json|https://github.com/apache/solr/blob/11253f05cfb31f9fb945c831d8889b3db1e607f1/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1784]
> with incorrect `base_url` and `node_name` (n2)
> # Once the state.json is updated with the incorrect values, it
> triggers`UnloadCoreOnDeleteWatcher`, which
> [unload/delete|https://github.com/apache/solr/blob/b8ca0ce23e2ebe1b33c85b71fc61ab9cf8411a35/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2865]
> the core. It will also later publish `DELETECORE` to remove such core from zk
> h2. Solution
> It seems rather risky to update the state.json and publish such replica as
> down if such core does exist in the state.json but with different node name.
> Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we
> should interrupt the core loading if current node name is different from
> zookeeper state.json's value. Such that it should not attempt to publish DOWN
> to such replica and update the state.json, which possibly is the wrong node
> name
>
> h2. Remarks
> With the proposed change, Solr will no longer "auto-correct" the state.json
> on startup if there's node name mismatch, no sure if that's a desirable
> behavior though. Some changes are made to unit test case so test restart
> would not change port number (ie changing the node name)
> Would love to get some input here!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]