[ https://issues.apache.org/jira/browse/UNOMI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923616#comment-17923616 ]
Serge Huber commented on UNOMI-874: ----------------------------------- Thanks for pointing that out [~jayblanc] . It is indeed a tricky problem that would need fixing. The more I think about it and the less I think we should keep the clustering inside of Unomi. I propose that for Unomi V3 we remove the clustering-specific code since even in cluster deployment it is not used. Removing Karaf Cellar and Hazelcast will also make it much easier to upgrade to newer versions of Karaf. I already have a prototype of V3 without the clustering and it seems to work fine. This doesn't mean that you could use Unomi in a cluster configuration anyway, just that all the node-to-node sync would not be done anymore which is usually something that you don't want to be done in real production environment. Rolling deployments should be used instead. > Cluster node config is empty > ---------------------------- > > Key: UNOMI-874 > URL: https://issues.apache.org/jira/browse/UNOMI-874 > Project: Apache Unomi > Issue Type: Improvement > Reporter: Jerome Blanchard > Priority: Major > > We faced a recurring (but flaky) problem in the clustered version of UNOMI : > Sometimes, one of ClusterNode contains a null configuration when queried > throught /cxs/cluster, thus publisHostAddress or internalHostAddress are > null and imply to takes into consideration that option when trying to reach > cluster node from client side. More than that, that node is not reachable > because of unexposed address. > It may be linked to a Cellar configuration replication bug that cause one of > the nodes to have that configuration problem : > [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC] > I think the replication problem occurs in ClusterServiceImpl.init() : > [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155] > If any other node is doing the same init() phase at the same time, cellar bug > occurs and make one of the config to be overridden by the other, causing a > node to exists in the karaf cluster but not having a config exposed. > When nodes are then listed in the getClusterNodes(), the global config for > the publicURL (which is a combined string of all nodes publicURLs serparated > by a ',') does not find it for a node : > [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191] > I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3 > and UNOMI rely on cellar 4.2.1.: > [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048] > Maybe backporting the fix could be an option... > At least, considering adding a healthcheck status according to a invalid > cluster node configuration could help to detect case : I suggest to add a > check on a null value of both publicHostAddress and internalHostAddress to > flag the cluster as not healthy. -- This message was sent by Atlassian Jira (v8.20.10#820010)