[ https://issues.apache.org/jira/browse/UNOMI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jerome Blanchard updated UNOMI-874: ----------------------------------- Description: We faced a recurring (but flaky) problem in the clustered version of UNOMI : Sometimes, one of ClusterNode contains a null configuration when queried throught /cxs/cluster, thus publisHostAddress or internalHostAddress are null and imply to takes into consideration that option when trying to reach cluster node from client side. More than that, that node is not reachable because of unexposed address. It may be linked to a Cellar configuration replication bug that cause one of the nodes to have that configuration problem : [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC] I think the replication problem occurs in ClusterServiceImpl.init() : [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155] If any other node is doing the same init() phase at the same time, cellar bug occurs and make one of the config to be overridden by the other, causing a node to exists in the karaf cluster but not having a config exposed. When nodes are then listed in the getClusterNodes(), the global config for the publicURL (which is a combined string of all nodes publicURLs serparated by a ',') does not find it for a node : [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191] I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3 and UNOMI rely on cellar 4.2.1.: [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048] Maybe backporting the fix could be an option... At least, considering adding a healthcheck status according to a invalid cluster node configuration could help to detect case : I suggest to add a check on a null value of both publicHostAddress and internalHostAddress to flag the cluster as not healthy. was: We faced a recurring (but flaky) problem in the clustered version of UNOMI : Sometimes, one of ClusterNode contains a null configuration when queried throught /cxs/cluster, thus publisHostAddress or internalHostAddress are null and imply to takes into consideration that option when trying to reach cluster node from client side. More than that, that node is not reachable because of unexposed address. It may be linked to a Cellar configuration replication bug that cause one of the nodes to have that configuration problem : [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC] I think the replication problem occurs in ClusterServiceImpl.init() : [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155] If any other node is doing the same init() phase at the same time, cellar bug occurs and make one of the config to be overridden by the other, causing a node to exists in the karaf cluster but not having a config exposed. When nodes are then listed in the getClusterNodes(), the global config for the publicURL (which is a combined string of all nodes publicURLs serparated by a ',') does not find it for a node : [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191] I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3 and UNOMI rely on cellar 4.2.1.: [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048] Maybe backporting the fix could be an option... > Cluster node config is empty > ---------------------------- > > Key: UNOMI-874 > URL: https://issues.apache.org/jira/browse/UNOMI-874 > Project: Apache Unomi > Issue Type: Improvement > Reporter: Jerome Blanchard > Priority: Major > > We faced a recurring (but flaky) problem in the clustered version of UNOMI : > Sometimes, one of ClusterNode contains a null configuration when queried > throught /cxs/cluster, thus publisHostAddress or internalHostAddress are > null and imply to takes into consideration that option when trying to reach > cluster node from client side. More than that, that node is not reachable > because of unexposed address. > It may be linked to a Cellar configuration replication bug that cause one of > the nodes to have that configuration problem : > [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC] > I think the replication problem occurs in ClusterServiceImpl.init() : > [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155] > If any other node is doing the same init() phase at the same time, cellar bug > occurs and make one of the config to be overridden by the other, causing a > node to exists in the karaf cluster but not having a config exposed. > When nodes are then listed in the getClusterNodes(), the global config for > the publicURL (which is a combined string of all nodes publicURLs serparated > by a ',') does not find it for a node : > [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191] > I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3 > and UNOMI rely on cellar 4.2.1.: > [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048] > Maybe backporting the fix could be an option... > At least, considering adding a healthcheck status according to a invalid > cluster node configuration could help to detect case : I suggest to add a > check on a null value of both publicHostAddress and internalHostAddress to > flag the cluster as not healthy. -- This message was sent by Atlassian Jira (v8.20.10#820010)