[
https://issues.apache.org/jira/browse/KARAF-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880556#comment-17880556
]
Jerome Blanchard commented on KARAF-7861:
-----------------------------------------
As this bug is impacting our product, could we start a discussion on an
eventual solution where we could participate ?
We are working on a sandbox environment that would reproduce the bug.
We have investigate some solutions to fix it :
* To detect that the orchestration is corrupted on the receiving node :
** Generate a hash of the updated map on the emitting node
** Include the has in the event
** Perform an integrity check on the receiving node to detect that the event
does not comply with the local map version
** Retry the check after a small replication wait()
** If the update never comes (successives updates on different nodes, lost
messages, ...) throw an error or log that local configuration is staled.
* The avoid orchestration problems :
** Do not use 2 steps in the update propagation
** Do not send specific event for configuration updates
(ConfiugrationEventHandler) but rely on the event that is available on the
ReplicatedMap process (provided by hazelcast) using an *EntryListener* like
mentionned in the documentation :
[https://docs.hazelcast.com/imdg/4.2/data-structures/replicated-map#using-entrylistener-on-replicated-map]
** That event will fire only when operation is finished on the local node and
will always contains the updated configuration
*Pro/cons :*
The first option rely on keeping using two steps in configuration propagation
but introducing conflict detection.
* CONS :
** It is not really reliable in terms of convergence as multiple configuration
updates on many nodes may create another race conditions and the sequence of
all messages won't be guaranty by the lake of central clock.
* PRO :
** it is light to implement and may at least provide a detection with a
wait/retry strategy falling back on exception. The exception could be used to
run another sync process like asking back the update or rolling back...
The second option based on hazelcast
* CONS :
** Harder to implement because hazelcast is hidden from cellar via
ClusterManager. Using the EntryListener will imply to go deeper in the
hazelcast integration
* PRO :
** Remove the orchestration to only use a single step avoiding any race
condition by concept
** Rely on hazeclast feature that would ensure a safer error management
> Configuration replication missed due to race condition in cellar
> ----------------------------------------------------------------
>
> Key: KARAF-7861
> URL: https://issues.apache.org/jira/browse/KARAF-7861
> Project: Karaf
> Issue Type: Bug
> Components: cellar
> Environment: Karaf using cellar in a clustered environment to
> replicated configuration updates.
> Reporter: Jerome Blanchard
> Priority: Major
>
> In a karaf cluster using cellar and more specifically cellar-config, updates
> of a configuration on a node is not replicated to another node.
> Investigations are pointing a race condition where one node receives the
> ClusterConfigurationEvent before the ReplicatedMap is effectively replicated
> on the impacted node. Thus, the node does not store the configuration and the
> local version keep staled.
> The race condition starts here :
> [https://github.com/Jahia/karaf-cellar/blob/47b6984217953a5263f7e1e0da040f488cef3a3e/config/src/main/java/org/apache/karaf/cellar/config/LocalConfigurationListener.java#L119-L127]
> and continues on another node here :
> [https://github.com/Jahia/karaf-cellar/blob/cellar-4.1.3-jahia-fixes/config/src/main/java/org/apache/karaf/cellar/config/ConfigurationEventHandler.java]
> Cellar is using a ReplicatedMap (hazelcast) to propagate configurations
> accross cluster and the replication operation is asynchronous. Thus, if the
> ClusterConfigurationEvent is received before the replication finish on the
> target node, nothing happens and no error is dedected nor retry.
> To reproduce the problem we can use breakpoints (thread ones) :
> * First one to simulate a long replicate operation by adding a breakpoint on
> the emitting node in the class
> *com.hazelcast.replicatedmap.impl.operation.ReplicateUpdateOperation.run()*
> * Second one in cellar event listener that apply the replicated
> configuration :
> *org.apache.karaf.cellar.config.ConfigurationEventHandler.handle()* at line:
> if (!equals(clusterDictionary, localDictionary) &&
> canDistributeConfig(localDictionary)) {
> Now you update a copnfiguration on the first node. On the target node, we can
> see that the configuration is not updated we the event is received.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)