[
https://issues.apache.org/jira/browse/NIFI-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298410#comment-17298410
]
Ryan LaMothe edited comment on NIFI-7631 at 3/9/21, 10:55 PM:
--------------------------------------------------------------
Hello [~mdekkers]! Yes, I am definitely open to collaboration. Progress on this
effort has unfortunately been a little slow to date, but fortunately I will be
focusing on these features more intently over the next four months. My Github
repo NiFi fork is located at
[https://github.com/rlamothe/nifi|https://github.com/rlamothe/nifi.]
As described above, the main feature add is the implementation of a "Claim
Check" pattern, moving the Flow Files into a remote messaging bus. That means
each NiFi Queue between NiFi Processors becomes a pub/sub remote Pulsar Topic.
Consequently this requires the physical content to be accessible from any node
in the cluster, as any node in the cluster could pull a Flow File off a topic.
My work on ApachePulsar FlowFileRepository has not yet been checked-in to
Github, for reasons I'll explain below.
How content is made globally available across the cluster is the next question.
My initial approach is being based on my previous experience storing individual
files on a shared filesystem, which is different than the HDFS and S3FS content
repository NARs available on Github, which appear to simply store the existing
NiFi content repositories as-is on a shared filesystem. My work has been
focused on creating a new SharedFileSystem Content Repository, that essentially
re-implements how NiFi stores and manages content (content claims, resource
claims, etc.) by changing the use of a single large repository file that NiFi
stores external content into/retrieves content out of.
The individual file storage approach has its pros/cons, which are definitely
worth debating, but the work has lead down a rabbit hole. NiFi has been
designed from the ground-up to assume execution on a single node, which is why
the various single node claim and repository models exist as they do.
Clustering was clearly bolted on later, as NiFi still effectively operates as a
bunch of individual nodes. Therefore the bulk of the brain power has been put
into understanding how to re-envision and re-implement/extend the existing
claim and repository models to assume clustered execution first, individual
node execution second. Design work is still on-going.
Random thought: Could we take a slightly comprised approach, leaving most of
the same claim and content repository implementation as-is, and simply storing
the content repositories on a shared filesystem as-is and figure out a way to
allow multiple nodes within a cluster to access a single content repository?
Then we could potentially just insert a pointer to where a content repository
is located on a shared filesystem and where the content itself is located
within that content repository, into flow files sent to Pulsar. This approach
seems reasonable, but is not ideal for a number of reasons, and needs more
thinking.
was (Author: ryanrlamothe):
Hello [~mdekkers]! Yes, I am definitely open to collaboration. Progress on this
effort has unfortunately been a little slow to date, but fortunately I will be
focusing on these features more intently over the next four months. My Github
repo NiFi fork is located at
[https://github.com/rlamothe/nifi|https://github.com/rlamothe/nifi.]
As described above, the main feature add is the implementation of a "Claim
Check" pattern, moving the Flow Files into a remote messaging bus. That mean
each NiFi Queue between NiFi Processors becomes a pub/sub remote Pulsar Topic.
Consequently this requires the physical content to be accessible from any node
in the cluster, as any node in the cluster could pull a Flow File off a topic.
My work on ApachePulsar FlowFileRepository has not yet been checked-in to
Github, for reasons I'll explain below.
How content is made globally available across the cluster is the next question.
My initial approach is being based on my previous experience storing individual
files on a shared filesystem, which is different than the HDFS and S3FS content
repository NARs available on Github, which appear to simply store the existing
NiFi content repositories as-is on a shared filesystem. My work has been
focused on creating a new SharedFileSystem Content Repository, that essentially
re-implements how NiFi stores and manages content (content claims, resource
claims, etc.) by changing the use of a single large repository file that NiFi
stores external content into/retrieves content out of.
The individual file storage approach has its pros/cons, which are definitely
worth debating, but the work has lead down a rabbit hole. NiFi has been
designed from the ground-up to assume execution on a single node, which is why
the various single node claim and repository models exist as they do.
Clustering was clearly bolted on later, as NiFi still effectively operates as a
bunch of individual nodes. Therefore the bulk of the brain power has been put
into understanding how to re-envision and re-implement/extend the existing
claim and repository models to assume clustered execution first, individual
node execution second. Design work is still on-going.
Random thought: Could we take a slightly comprised approach, leaving most of
the same claim and content repository implementation as-is, and simply storing
the content repositories on a shared filesystem as-is and figure out a way to
allow multiple nodes within a cluster to access a single content repository?
Then we could potentially just insert a pointer to where a content repository
is located on a shared filesystem and where the content itself is located
within that content repository, into flow files sent to Pulsar. This approach
seems reasonable, but is not ideal for a number of reasons, and needs more
thinking.
> Create nifi.content.repository.implementation for SharedFileSystem and
> nifi.flowfile.repository.implementation for Apache Pulsar
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-7631
> URL: https://issues.apache.org/jira/browse/NIFI-7631
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Ryan LaMothe
> Priority: Major
>
> I would like to begin the development of a new
> nifi.content.repository.implementation that utilizes a shared filesystem and
> a new nifi.flowfile.repository.implementation that utilizes Apache Pulsar. In
> our modern, cloud-based streaming message environments, we are using high
> performance shared filesystems for persistent file/object storage and Apache
> Pulsar for message management. Apache NiFi currently supports only local disk
> (non-volatile) and in-memory (volatile) repository implementations. This
> means that Apache NiFi currently performs double duty as both a workflow
> management environment and a message/data management system, as there are no
> shared data management or remote message management repository
> implementations available.
> The proposed new feature development would create a new content repository
> implementation designed around a shared filesystem and a new flowfile
> implementation designed around a distributed message bus. In essence,
> replacing the concept of "NiFi local queues" with "shared filesystem storage
> and remote flowfile queues". This work will support NiFi as a pure workflow
> management environment, decoupling it from its current data and message
> management responsibilities.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)