[
https://issues.apache.org/jira/browse/UNOMI-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022614#comment-18022614
]
Serge Huber commented on UNOMI-908:
-----------------------------------
While I do understand the concern about polling the persistence layer, I do
have some reservations about (re-) introducing a cache layer. We just removed
Hazelcast and now we are proposing to introduce another cache dependency?
The main issue with these layers is that they come and go. We had JGroups, we
had Hazelcast, now Infinispan. A lot of them are now maintained over the long
term and need replacing after a while.
In my new branch, I improved the solution by centralizing the way that local
in-memory caching is handled by centralizing the scheduling, caching and more.
The new scheduler is also more resistant to failure and we could still improve
resiliancy to restarts to make sure all the polls restart correctly.
Is there a specific production issue that needs to be solved here or is this
mainly theoritical ?
The additional index for ClusterNodes isn't that much of a problem. In the
worst case it will have maybe 100 documents in it. With multi-tenancy coming
this would allow to have lots of nodes handling thousands of customers.
> Introduce Distributed Cache to avoid intensive polling
> ------------------------------------------------------
>
> Key: UNOMI-908
> URL: https://issues.apache.org/jira/browse/UNOMI-908
> Project: Apache Unomi
> Issue Type: Improvement
> Components: unomi(-core)
> Affects Versions: unomi-3.0.0
> Reporter: Jerome Blanchard
> Priority: Major
>
> h3. Context
> Currently, some Unomi entities (like rules, segment, propertyTypes...) are
> polled in Elastic Search every seconds using Scheduled Job to ensure that if
> a another Unomi node has made some modifications on it, it is locally
> refreshed.
> This approach, while functioning, is not efficient moreover in case of low
> frequency updates of the entity.
> More than that, without a strong scheduler engine capable of watchdog and
> failover, such a scheduler implementation can die silently causing invisible
> integrity problems leading to corrupted data. We already faced similar
> production issues having nodes with different rule set.
> A second point is with the removal of Karaf Cellar (Karaf cluster bundle and
> config propagation feature) in Unomi 3, another cluster topology monitoring
> has been introduced. This new implementation rely on a dedicated entity :
> ClusterNode stored in a dedicated index of elasticsearch.
> Every 10 seconds (using also a scheduled job), the ClusterNode document is
> updated by setting its heartbeat field to the current timestamp. In the same
> time, other ClusterNode are checked to see if the latest heartbeat is fresh
> enough to keep it or not.
> This topology management is very resource-intensive and does not follow state
> of the art in terms of architecture.
> Topology information is not something that need to be persisted (expect for
> needs of audit which is not the case here) and must to be managed in memory
> using dedicated and proven algorithm.
> Generally in Enterprise Application Servers or Frameworks (Jakarta EE, .net,
> Spring) this kind of transversal and generic services are built in and
> offered by the server avoiding the need of specific implementation.
> We may think about packaging it in a dedicated feature to fully decouples
> Unomi logic from that one and to allow better isolated testing with specific
> scenarios, outside of unomi.
> h3. Proposal
> The goal here is to propose a solution that will address both problems by
> relying on an external, proven and widely used solution: distributed caching
> with Infinispan.
> Because distributed caching libraries needs to rely on a cluster topology
> manager inside, we could use the same tools for managing entities cache
> without polling AND to discover and monitor the cluster topology.
> We propose to use a generic caching features packaged for Karaf embedding
> Infinispan. It will be packaged as a dedicated generic caching service based
> on annotated methods and directly inspired by the current Unomi entity cache.
> Thus, the underlying JGroups library used in Infinispan will also be exposed
> to refactor the Unomi ClusterService instead of using a persistent entity.
> By externalizing caching into a dedicated, widely used and proven solution
> the Unomi code will become lighter and more robust to manage cluster oriented
> operations on entities.
> The use of a Distributed Cache for persistent entities is something that is
> widely used for decades and integrated in all Enterprise Level framework
> (EJB, ,Spring, ...) for a very long time. This is proven technology with very
> strong implementation and support and Infinispan is one of the best reference
> in that domain (used in Widlfy, Hibernate, Apache Camel, ...)
> h3. Tasks
> * Package a Unomi Cache feature that will rely on an embedded Infinispan
> * Refactor ClusterServiceImpl to takes advantage of infinispan cluster
> manger or simply store ClusterNode in the DistributedCache instead of in
> ElasticSearch.
> * Remove Elasticsearch-based persistence logic for ClusterNode.
> * Ensure heartbeat updates are managed via distributed cache ; if not, rely
> on the distributed cache underlying cluster management to manage ClusterNode
> entities (JGroup for Infinispan)
> * Remove Entity polling feature and use the distributed caching strategy for
> the operations that loads entities in storage.
> ** Current listRules() is refactor to simply load entities in ES but with
> distributed caching.
> ** updateRule() operation will also propagate the update in the distributed
> cache avoiding any polling latency.
> * Update documentation to reflect the new architecture.
> h3. Definition of Done
> * ClusterNode information is available and updated without Elasticsearch.
> * No additional Elasticsearch index is created for cluster nodes.
> * Heartbeat mechanism works reliably.
> * All 'cacheable' entities rely on the dedicated cluster aware cache feature
> based on infinispan karaf feature.
> * All polling jobs are removed
> * Test for entities update propagation over cluster is setup
> * All relevant documentation is updated.
> * Integration tests confirm correct cluster node management and heartbeat
> updates.
> * No regression in cluster management functionality.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)