[
https://issues.apache.org/jira/browse/NIFI-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068621#comment-18068621
]
Prabhjyot Singh commented on NIFI-8843:
---------------------------------------
Hi all,
I took a stab at implementing HA support for NiFi Registry and wanted to share
my attempt here for feedback from the community.
The implementation adds full multi-node HA to NiFi Registry for the first time,
with two selectable coordination backends (zookeeper and database) and zero
behavioural change for existing single-node deployments.
What's covered:
- ZooKeeper Curator LeaderSelector (and a DB TTL fallback) for leader election
- Write replication via WriteReplicationFilter — followers return 307 Temporary
Redirect to the leader (preserving mTLS client identity end-to-end), the leader
fans out asynchronously to all followers
- Push-based cache coherency via ZK ZNode watchers (or CACHE_VERSION table
polling in DB mode)
- Durable event delivery via ClusterAwareEventService with at-least-once
semantics
- Bootstrap DB sync using H2's native SCRIPT/RUNSCRIPT — no external tooling
needed
- Maintenance mode endpoint and cluster health indicator via Spring Boot
Actuator
- ZooKeeper TLS support
- Flyway V9 migration adding CACHE_VERSION, CLUSTER_LEADER, and REGISTRY_EVENT
tables
PR with full details, flow diagrams, and design notes:
https://github.com/apache/nifi/pull/11046
I'm sure there are areas that need improvement — happy to iterate based on any
feedback. Appreciate the community's time in reviewing this.
Thanks
> Maintenance mode switch via REST API for data backup
> ----------------------------------------------------
>
> Key: NIFI-8843
> URL: https://issues.apache.org/jira/browse/NIFI-8843
> Project: Apache NiFi
> Issue Type: New Feature
> Components: NiFi Registry
> Reporter: Kevin Doran
> Priority: Minor
> Labels: HA
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, NiFi Registry does not offer High Availability (HA) out of the
> box. One has to configure an environment around one or more NiFi Registry
> instances to achieve the required level of recoverability and availability.
> This is not a requirement in many deployment scenarios as NiFi Registry is on
> the critical path of most system architectures. That is, it is a place to
> save and retrieve versions of flows and extensions, but if NiFi Registry is
> temporarily offline, NiFi data flows deployed to NiFi and MiNiFi instances
> continue to function just fine.
> However, a bigger concern is data availability and backup; that is, the
> guarantee that data persisted to NiFi Registry is not lost due to an instance
> failure. Eventually, it will be nice to offer a NiFi Registry HA solution
> that allows for replicated data or external persistence providers (that
> themselves can be HA).
> In the meantime, folks are looking for the best way to build their own data
> backup and recovery solutions for NiFi Registry. A lot of possible solutions
> and recommendations for backup and recovery or [cold-slave
> failover|http://www.sonatype.org/nexus/2015/07/10/high-availability-ha-and-continuous-integration-ci-with-nexus-oss/]
> require copying the data in the NiFi Registry's home directory host storage
> to another location, where it could be used to create another NiFi Registry
> with the same data on demand, e.g., in a cloud migration or disaster recovery
> scenario.
> If the NiFi Registry service is running when this copy operation is
> performed, one risks copying partially-written data/records/files that could
> be corrupted when later loaded/read from disk. One solution for this today is
> to stop the NiFi Registry, but this leaves it unavailable for users and
> scripts, which is not ideal. For example, continuous deployment scripts for
> NiFi data flows that read flows from NiFi registry would not be able to
> access a required service.
> In the long-term, it would be nice to offer proper HA NiFi Registry solution
> out of the box. However, in the short-term, in order to avoid having to
> shutdown NiFi Registry in order to initiate a backup, it would be nice for
> admins to be able to put a NiFi Registry instance into "read only maintenance
> mode", during which the contents of the NiFi Registry home directory could be
> more safely copied to a backup location or cold spare. (I say "more safely"
> because some files in the home directory, such as the default location for
> logs, would continue to be written too, but the most important files, such as
> those used by the file-based database and persistence providers, would
> stabilize after existing write operations are flushed to disk.)
> Implementation thoughts:
> - endpoints for turning maintenance mode on/off would fit in nicely as
> custom endpoints under Actuator (NIFIREG-134), and therefore could be access
> controlled but Actuator authorization rules
> - when maintenance mode is enabled, a custom Spring filter could intercept
> any requests that modify persisted state (eg, by resource path and HTTP
> method pattern matching) return a "503 Service Unavailable" status code
> indicating that the resource is temporarily unavailable. A spring filter
> checking HTTP methods against resources is an approach already used to
> authorize access to certain resources, so there might be an opportunity for
> code-reuse there (the maintenance mode filter would need to be dynamically,
> programmatically enabled/disabled, and instead of returning a 403, we would
> return a 503)
> - when maintenance mode is enabled, the /actuator/health endpoint could also
> indicate this, giving clients a way to check if a server is in maintenance
> mode or not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)