[GitHub] [nifi] exceptionfactory commented on a change in pull request #4669: NIFI-7897: Refactored NiFi Stateless to make use of existing NiFi classes

GitBox Mon, 23 Nov 2020 13:09:01 -0800


exceptionfactory commented on a change in pull request #4669:
URL: https://github.com/apache/nifi/pull/4669#discussion_r528996539




##########
File path: 
nifi-nar-bundles/nifi-framework-bundle/nifi-stateless-bundle/README.md
##########
@@ -0,0 +1,405 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+      http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# Introduction
+
+The Apache NiFi application can be thought of as two separate but intertwined 
components: the flow authorship component
+and the flow engine. By bringing these two components together into a single 
application, NiFi allows users to
+author a dataflow and run it in real-time in the same user interface.
+
+However, these two concepts can be separated. NiFi can be used to author 
flows, which can then be run by not only
+NiFi but also other compatible dataflow engines. The Apache NiFi project 
provides several of these dataflow engines:
+Apache NiFi itself, MiNiFi Java (A sub-project of Apache NiFi), MiNiFi C++ (A 
sub-project of Apache NiFi), and
+Stateless NiFi.
+
+Each of these dataflow engines has its own sets of strengths and weaknesses 
and as a result have their own particular
+use cases that they solve best. This document will describe what Stateless 
NiFi is, how to use it, and its strengths
+and weaknesses.
+
+
+
+# Traditional NiFi
+
+NiFi is designed to be run as a large, multi-tenant application. It strives to 
take full advantage of all resources
+given to it, to include disks/storage and many threads. Typically, a single 
NiFi instance is clustered across many
+different nodes to form a large, cohesive dataflow, which may be made up of 
many different sub-flows. NiFi, in general,
+will assume ownership of data that is delivered to it. It stores that data 
reliably on disk until it has been delivered
+to all necessary destinations. Delivery of this data may be prioritized at 
different points in the flow so that data
+that is most important to a particular destination gets delivered to that 
destination first, while that same data may
+be delivered to another destination in a different order based on 
prioritization. NiFi does all of this while maintaining
+very fine-grained lineage and holding a buffer of data as it was seen by every 
component in the flow (the combination of
+the data lineage and the rolling buffer of data is referred to as Data 
Provenance).
+
+Each of these features is important to provide a very powerful, broad, 
holistic view of how data is operated on, and flows
+through, an enterprise. There are use cases, however, that would be better 
served by a much lighter weight application.
+An application that is capable of interacting with all of the different 
endpoints that NiFi can interact with and perform
+all of the transformations, routing, filtering, and processing that NiFi can 
perform. But an application that is designed
+to run only a small sub-flow, not a large dataflow with many sources and sinks.
+
+
+# Stateless NiFi 
+
+Enter Stateless NiFi (also referred to in this document as simply "Stateless").
+
+Many of the concepts in Stateless NiFi differ from those in the typical Apache 
NiFi engine.
+
+Stateless provides a dataflow engine with a smaller footprint. It does not 
include a user interface for
+authoring or monitoring dataflows but instead runs dataflows that were 
authored using the NiFi application.
+While NiFi performs best when it has access to fast storage such as SSD and 
NVMe drives, Stateless stores
+all data in memory.
+
+This means that if Stateless NiFi is stopped, it will no longer have direct 
access to the data that was in-flight.
+As a result, Stateless should only be used for dataflows where the data source 
is both reliable and replayable, or
+in scenarios where data loss is not a critical concern.
+
+A very common use case is to have Stateless NiFi read data from Apache Kafka 
or JMS and then perform some routing/filtering/
+manipulation and finally deliver the data to another destination. If a 
dataflow like this were to be run within NiFi,
+the data would be consumed from the source, written to NiFi's internal 
repositories, and acknowledged, so that NiFi will have
+taken ownership of that data. It will then be responsible for delivering it to 
all destinations, even if the application
+is restarted.
+
+With Stateless NiFi, though, the data would be consumed and then transferred 
to the next processor in the flow. The data
+would not be written to any sort of internal repository, and it would not yet 
be acknowledged. The next processor in the
+flow would process the data, and then pass it along. Only once the data 
reaches the end of the entire dataflow would the
+data received from the source be acknowledged. If Stateless is restarted 
before the processing completes, the data has
+not yet been acknowledged, so it is simply consumed again. This allows the 
data to be processed in-memory without fear
+of data loss, but it does also put onus on the source to store the data 
reliably and make the data replayable.
+
+
+## Compatible Dataflows
+
+As mentioned above, Stateless NiFi requires that the source of data be both 
reliable and replayable. This limits
+the sources that Stateless can reasonably interact with. Additionally, there 
are a few other limitations to
+the dataflows that the Stateless engine is capable of running.
+
+#### Single Source, Single Destination
+
+Each dataflow that is run in Stateless should be kept to a single source and a 
single sink, or destination.
+Because Stateless does not store data that it is processing, and does not 
store metadata such as where data is
+queued up in a dataflow, sending a single FlowFile to multiple destinations 
can result in data duplication.
+
+Consider a flow where data is consumed from Apache Kafka and then delivered to 
both HDFS and S3. If data is stored
+in HDFS, and then storing to S3 fails, the entire session will be rolled back, 
and the data will have to be consumed
+again. As a result, the data may be consumed and delivered to HDFS a second 
time. If this continues to happen, the data
+will be continually fetched from Kafka and stored in HDFS. Depending on the 
destination and the flow configuration, this
+may not be a concern (aside from wasted resources) but in many cases, this is 
a significant concern.
+
+Therefore, if the dataflow is to be run with the Stateless engine, a dataflow 
such a this should be broken apart into two
+different dataflows. The first would deliver data from Apache Kafka to HDFS 
and the other would deliver data from Apache Kafka
+to S3. Each of these dataflows should then use a separate Consumer Group for 
Kafka, which will result in each dataflow getting
+a copy of the same data.
+
+#### Merging Not Supported
+
+Because data in Stateless NiFi transits through the dataflow synchronously 
from start to finish, use of Processors
+that require multiple FlowFiles, such as MergeContent and MergeRecord, will 
not succeed. Instead, the Processor
+will continually be triggered to run with only a single FlowFile in its queue. 
Since that FlowFile will generally not
+be enough to fill a 'Bin' in MergeContent or MergeRecord, the FlowFile will 
remain in the queue. Stateless will continue
+to trigger the processor until the FlowFile is merged by itself (due to 
Processor's Max Bin Duration being reached).
+If no Max Bin Duration is configured, it will trigger continually without 
making progress.
+
+#### Cycles Not Supported
+
+In traditional NiFi, it is common to loop a 'failure' connection from a given 
Processor back to the same Processor.
+This results in the Processor continually trying to process the FlowFile until 
it is successful. However, because of
+the difference in how data transits the dataflow (i.e., synchronously in 
Stateless and Asynchronously in standard NiFi),
+this can result in the Processor recursively calling itself. This may be okay 
for some dataflows, which are intended
+to loop a few times. However, for a failure loop that constantly triggers 
itself, this will result in a 
+StackOverflowException being thrown.
+
+Instead, this should be handled in Stateless by routing the failure to an 
Output Port and then marking that Output Port
+as a failure port (see [Failure Ports](#failure-ports) below for more 
information).
+
+#### Flows Should Not Load Massive Files
+
+In traditional NiFi, FlowFile content is stored on disk, not in memmory. As a 
result, it is capable of handling any size
+data as long as it fits on the disk. However, in Stateless, FlowFile contents 
are stored in memory, in the JVM heap. As
+a result, it is generally not advisable to attempt to load massive files, such 
as a 100 GB dataset, into Stateless NiFi.
+Doing so will often result in an OutOfMemoryError, or at a minimum cause 
significant garbage collection, which can degrade
+performance.
+
+
+
+## Feature Comparisons
+
+As mentioned above, Stateless NiFi offers a different set of features and 
tradeoffs from traditional NiFi.
+Here, we summarize the key differences. This comparison is not exhaustive but 
provides a quick look at how
+the two runtimes operate.
+
+| Feature | Traditional NiFi | Stateless NiFi |
+|---------|------------------|----------------|
+| Data Durability | Data is reliably stored on disk in the FlowFile and 
Content Repositories | Data is stored in-memory and must be consumed from the 
source again upon restart |
+| Data Ordering | Data is ordered independently in each Connection based on 
the selected Prioritizers | Data flows through the system in the order it was 
received (First-In, First-Out / FIFO) |
+| Site-to-Site | Supports full Site-to-Site capabilities, including Server and 
Client roles | Can push to, or pull from, a NiFi instance but cannot receive 
incoming Site-to-Site connections. I.e., works as a client but not a server. |
+| Form Factor | Large form factor. Designed to take advantage of many cores 
and disks. | Light-weight form factor. Easily embedded into another 
application. Single-threaded processing. |
+| Heap Considerations | Typically, many processors in use by many users. 
FlowFile content should not be loaded into heap because it can easily cause 
heap exhaustion. | Smaller dataflows use less heap. Flow operates on only one 
or a few FlowFiles at a time and holds FlowFile contents in memory in the Java 
heap. |
+| Data Provenance | Fully stored, indexed data provenance that can be browsed 
through the UI and exported via Reporting Tasks | Limited Data Provenance 
capabilities, events being stored in memory. No ability to view but can be 
exported using Reporting Tasks. However, since they are in-memory, they will be 
lost upon restart and may roll off before they can be exported. |
+| Embeddability | While technically possible to embed traditional NiFi, it is 
not recommended, as it launches a heavy-weight User Interface, deals with 
complex authentication and authorization, and several file-based external 
dependencies, which can be difficult to manage. | Has minimal external 
dependencies (directory containing extensions and a working directory to use 
for temporary storage) and is much simpler to manage. Embeddability is an 
important feature of Stateless NiFi. |
+ 
+## Running Stateless NiFi
+
+Stateless NiFi can be used as a library and embedded into other applications. 
However, it can also be run directly
+from the command-line from a NiFi build using the `bin/nifi.sh` script.
+
+To do so requires three files:
+
+- The engine configuration properties file
+- The dataflow configuration properties file
+- The dataflow itself (which may exist as a file, or point to a flow in a NiFi 
registry)
+
+Stateless NiFi accepts two separate configuration files: an engine 
configuration file and a dataflow configuration file.
+This is done because typically the engine configuration will be the same for 
all flows that are run, so it can be created
+only once. The dataflow configuration will be different for each dataflow that 
is to be run.
+
+An example of running stateless NiFi:
+
+```
+bin/nifi.sh stateless -c /var/lib/nifi/stateless/config/stateless.properties 
/var/lib/nifi/stateless/flows/jms-to-kafka.properties

Review comment:
       Thanks for the explanation, that makes sense.  As implemented, it does 
provide a helpful differentiation between required arguments and options, so 
that is helpful.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] exceptionfactory commented on a change in pull request #4669: NIFI-7897: Refactored NiFi Stateless to make use of existing NiFi classes

Reply via email to