[ 
https://issues.apache.org/jira/browse/NIFI-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160523#comment-17160523
 ] 

Ryan LaMothe commented on NIFI-7631:
------------------------------------

Thank you for your feedback, I apologize if my initial post was unclear. I have 
a fairly in-depth understanding of how NiFi works internally, although I am 
always learning. Currently, NiFi has both a tight integration and a tight 
coupling between the NiFi UI, NiFi’s processors and NiFi’s underlying data 
management (content repository) and queuing implementations (flowfile 
repository). The goal of this effort is to decouple the NiFi UI from both the 
content management and queueing implementation by introducing two new remote 
repository implementations, that can managed and optimized independent of both 
NiFi and each other.

To give some background, for a number of years I worked on a non-open source 
implementation of NiFi that decoupled the NiFi UI from the underlying NiFi 
Content Repository, by leveraging a distributed HPC shared filesystem (i.e. 
Lustre) for file management instead of multiple dedicated local RAID arrays on 
each NiFi node. At a high-level, the way we approached the problem was to store 
all files in remote network folders, process files remotely, move completed 
files to new folders, and then pass the URI of each file to the next processor. 
At a lower level, this pattern is pretty well known, effectively using an 
“incoming” directory for each NiFi queue and a “working” directory for 
in-progress files (as needed). Once a NiFi processor completed its work, the 
file being worked or referenced was moved from the current queue’s “working” 
folder to the “incoming” folder of the next queue. This leverages decades old 
filesystem semantics and guarantees. We also experimented with multiple 
distributed flow file management options. This effectively decoupled the NiFi 
UI from NiFi’s integrated data management, providing multiple notable benefits 
such as never having overloaded individual nodes, queues were simple and 
lightweight, nodes never ran out of disk space, node failures were easy to 
recover from and scaling NiFi became simpler, including allowing us to 
containerize NiFi without having to deal with complicated local disk storage 
and redistribution mechanisms, just to name a few.

This got me thinking about how to decouple the Content Repository and Flowfile 
Repository from the NiFi UI, sans HPC filesystem. A good replacement for the 
NiFi Flowfile repository would be a proper high performance distributed 
queueing solution, such as Pulsar, Kafka, ActiveMQ, etc. An added benefit of 
some of these modern queuing systems is that they also support high performance 
object storage as well, essentially killing two birds with one stone. The 
reason for this New Feature request is to create a new Content Repository 
implementation and likely a new Flowfile Repository implementation that 
utilizes Apache Pulsar in place of the existing local disk-focused NiFi 
repository solutions.

A quick note about Pulsar; “Messages are the basic "unit" of Pulsar. Messages 
are what producers publish to topics and what consumers then consume from 
topics (and acknowledge when the message has been processed). Messages are the 
analogue of letters in a postal service system.”

 

> Create a nifi.content.repository.implementation for Apache Pulsar 
> ------------------------------------------------------------------
>
>                 Key: NIFI-7631
>                 URL: https://issues.apache.org/jira/browse/NIFI-7631
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Ryan LaMothe
>            Priority: Major
>
> I would like to begin the development of a new 
> nifi.content.repository.implementation for Apache Pulsar. In our modern, 
> cloud-based streaming message environments, we are using Apache Pulsar for 
> all of our persistent message/data and stream management. Apache NiFi 
> currently supports only local disk (non-volatile) and in-memory (volatile) 
> content repository implementations. This means that Apache NiFi currently 
> performs double duty as both a workflow management environment and a 
> message/data management system, as there are no remote message/data 
> management content repository implementations available.
> The proposed new feature development would create a new content repository 
> implementation designed around a streaming message/data architecture, in 
> essence replacing the concept of a "NiFi local queue" with an "Apache Pulsar 
> remote queue", allowing Apache Pulsar to remotely and independently manage 
> messages/data on the behalf of NiFi. This would also support NiFi as a pure 
> workflow management environment, decoupling it from its data management 
> responsibilities.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to