[
https://issues.apache.org/jira/browse/NIFI-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160523#comment-17160523
]
Ryan LaMothe commented on NIFI-7631:
------------------------------------
Thank you for your feedback, I apologize if my initial post was unclear. I have
a fairly in-depth understanding of how NiFi works internally, although I am
always learning. Currently, NiFi has both a tight integration and a tight
coupling between the NiFi UI, NiFi’s processors and NiFi’s underlying data
management (content repository) and queuing implementations (flowfile
repository). The goal of this effort is to decouple the NiFi UI from both the
content management and queueing implementation by introducing two new remote
repository implementations, that can managed and optimized independent of both
NiFi and each other.
To give some background, for a number of years I worked on a non-open source
implementation of NiFi that decoupled the NiFi UI from the underlying NiFi
Content Repository, by leveraging a distributed HPC shared filesystem (i.e.
Lustre) for file management instead of multiple dedicated local RAID arrays on
each NiFi node. At a high-level, the way we approached the problem was to store
all files in remote network folders, process files remotely, move completed
files to new folders, and then pass the URI of each file to the next processor.
At a lower level, this pattern is pretty well known, effectively using an
“incoming” directory for each NiFi queue and a “working” directory for
in-progress files (as needed). Once a NiFi processor completed its work, the
file being worked or referenced was moved from the current queue’s “working”
folder to the “incoming” folder of the next queue. This leverages decades old
filesystem semantics and guarantees. We also experimented with multiple
distributed flow file management options. This effectively decoupled the NiFi
UI from NiFi’s integrated data management, providing multiple notable benefits
such as never having overloaded individual nodes, queues were simple and
lightweight, nodes never ran out of disk space, node failures were easy to
recover from and scaling NiFi became simpler, including allowing us to
containerize NiFi without having to deal with complicated local disk storage
and redistribution mechanisms, just to name a few.
This got me thinking about how to decouple the Content Repository and Flowfile
Repository from the NiFi UI, sans HPC filesystem. A good replacement for the
NiFi Flowfile repository would be a proper high performance distributed
queueing solution, such as Pulsar, Kafka, ActiveMQ, etc. An added benefit of
some of these modern queuing systems is that they also support high performance
object storage as well, essentially killing two birds with one stone. The
reason for this New Feature request is to create a new Content Repository
implementation and likely a new Flowfile Repository implementation that
utilizes Apache Pulsar in place of the existing local disk-focused NiFi
repository solutions.
A quick note about Pulsar; “Messages are the basic "unit" of Pulsar. Messages
are what producers publish to topics and what consumers then consume from
topics (and acknowledge when the message has been processed). Messages are the
analogue of letters in a postal service system.”
> Create a nifi.content.repository.implementation for Apache Pulsar
> ------------------------------------------------------------------
>
> Key: NIFI-7631
> URL: https://issues.apache.org/jira/browse/NIFI-7631
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Ryan LaMothe
> Priority: Major
>
> I would like to begin the development of a new
> nifi.content.repository.implementation for Apache Pulsar. In our modern,
> cloud-based streaming message environments, we are using Apache Pulsar for
> all of our persistent message/data and stream management. Apache NiFi
> currently supports only local disk (non-volatile) and in-memory (volatile)
> content repository implementations. This means that Apache NiFi currently
> performs double duty as both a workflow management environment and a
> message/data management system, as there are no remote message/data
> management content repository implementations available.
> The proposed new feature development would create a new content repository
> implementation designed around a streaming message/data architecture, in
> essence replacing the concept of a "NiFi local queue" with an "Apache Pulsar
> remote queue", allowing Apache Pulsar to remotely and independently manage
> messages/data on the behalf of NiFi. This would also support NiFi as a pure
> workflow management environment, decoupling it from its data management
> responsibilities.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)