[ 
https://issues.apache.org/jira/browse/UNOMI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923616#comment-17923616
 ] 

Serge Huber commented on UNOMI-874:
-----------------------------------

Thanks for pointing that out [~jayblanc] . It is indeed a tricky problem that 
would need fixing.

The more I think about it and the less I think we should keep the clustering 
inside of Unomi. I propose that for Unomi V3 we remove the clustering-specific 
code since even in cluster deployment it is not used. Removing Karaf Cellar and 
Hazelcast will also make it much easier to upgrade to newer versions of Karaf. 

I already have a prototype of V3 without the clustering and it seems to work 
fine. 

This doesn't mean that you could use Unomi in a cluster configuration anyway, 
just that all the node-to-node sync would not be done anymore which is usually 
something that you don't want to be done in real production environment. 
Rolling deployments should be used instead.

 

> Cluster node config is empty
> ----------------------------
>
>                 Key: UNOMI-874
>                 URL: https://issues.apache.org/jira/browse/UNOMI-874
>             Project: Apache Unomi
>          Issue Type: Improvement
>            Reporter: Jerome Blanchard
>            Priority: Major
>
> We faced a recurring (but flaky) problem in the clustered version of UNOMI :
> Sometimes, one of ClusterNode contains a null configuration when queried 
> throught /cxs/cluster, thus publisHostAddress or  internalHostAddress  are 
> null and imply to takes into consideration that option when trying to reach 
> cluster node from client side. More than that, that node is not reachable 
> because of unexposed address.
> It may be linked to a Cellar configuration replication bug that cause one of 
> the nodes to have that configuration problem :
> [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC]
> I think the replication problem occurs in ClusterServiceImpl.init() :
> [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155]
> If any other node is doing the same init() phase at the same time, cellar bug 
> occurs and make one of the config to be overridden by the other, causing a 
> node to exists in the karaf cluster but not having a config exposed.
> When nodes are then listed in the getClusterNodes(), the global config for 
> the publicURL (which is a combined string of all nodes publicURLs serparated 
> by a ',') does not find it for a node :
> [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191]
> I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3 
> and UNOMI rely on cellar 4.2.1.:
> [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048]
> Maybe backporting the fix could be an option...
> At least, considering adding a healthcheck status according to a invalid 
> cluster node configuration could help to detect case : I suggest to add a 
> check on a null value of both publicHostAddress and internalHostAddress to 
> flag the cluster as not healthy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to