Modified: nifi/site/trunk/docs/nifi-docs/html/nifi-in-depth.html URL: http://svn.apache.org/viewvc/nifi/site/trunk/docs/nifi-docs/html/nifi-in-depth.html?rev=1771892&r1=1771891&r2=1771892&view=diff ============================================================================== --- nifi/site/trunk/docs/nifi-docs/html/nifi-in-depth.html (original) +++ nifi/site/trunk/docs/nifi-docs/html/nifi-in-depth.html Tue Nov 29 12:03:34 2016 @@ -455,48 +455,48 @@ body.book #toc,body.book #preamble,body. <div id="toc" class="toc"> <div id="toctitle">Table of Contents</div> <ul class="sectlevel1"> -<li><a href="nifi-in-depth.html#intro">Intro</a></li> -<li><a href="nifi-in-depth.html#repositories">Repositories</a> +<li><a href="#intro">Intro</a></li> +<li><a href="#repositories">Repositories</a> <ul class="sectlevel2"> -<li><a href="nifi-in-depth.html#flowfile-repository">FlowFile Repository</a></li> -<li><a href="nifi-in-depth.html#content-repository">Content Repository</a></li> -<li><a href="nifi-in-depth.html#provenance-repository">Provenance Repository</a></li> -<li><a href="nifi-in-depth.html#general-repository-notes">General Repository Notes</a></li> +<li><a href="#flowfile-repository">FlowFile Repository</a></li> +<li><a href="#content-repository">Content Repository</a></li> +<li><a href="#provenance-repository">Provenance Repository</a></li> +<li><a href="#general-repository-notes">General Repository Notes</a></li> </ul> </li> -<li><a href="nifi-in-depth.html#life-of-a-flowfile">Life of a FlowFile</a> +<li><a href="#life-of-a-flowfile">Life of a FlowFile</a> <ul class="sectlevel2"> -<li><a href="nifi-in-depth.html#webcrawler-template">WebCrawler Template:</a></li> -<li><a href="nifi-in-depth.html#data-ingress">Data Ingress</a></li> -<li><a href="nifi-in-depth.html#pass-by-reference">Pass by Reference</a></li> -<li><a href="nifi-in-depth.html#extended-routing-use-cases">Extended Routing Use-cases:</a></li> -<li><a href="nifi-in-depth.html#funnels">Funnels</a></li> -<li><a href="nifi-in-depth.html#copy-on-write">Copy on Write</a></li> -<li><a href="nifi-in-depth.html#updating-attributes">Updating Attributes</a></li> -<li><a href="nifi-in-depth.html#data-egress">Data Egress</a></li> +<li><a href="#webcrawler-template">WebCrawler Template</a></li> +<li><a href="#data-ingress">Data Ingress</a></li> +<li><a href="#pass-by-reference">Pass by Reference</a></li> +<li><a href="#extended-routing-use-cases">Extended Routing Use-cases</a></li> +<li><a href="#funnels">Funnels</a></li> +<li><a href="#copy-on-write">Copy on Write</a></li> +<li><a href="#updating-attributes">Updating Attributes</a></li> +<li><a href="#data-egress">Data Egress</a></li> </ul> </li> -<li><a href="nifi-in-depth.html#closing-remarks">Closing Remarks</a></li> +<li><a href="#closing-remarks">Closing Remarks</a></li> </ul> </div> </div> <div id="content"> <div class="sect1"> -<h2 id="intro"><a class="anchor" href="nifi-in-depth.html#intro"></a>Intro</h2> +<h2 id="intro"><a class="anchor" href="#intro"></a>Intro</h2> <div class="sectionbody"> <div class="paragraph"> <p>This advanced level document is aimed at providing an in-depth look at the implementation and design decisions of NiFi. It assumes the reader has read enough of the other documentation to know the basics of NiFi.</p> </div> <div class="paragraph"> -<p>FlowFiles are at the heart of NiFi and its flow-based design. A FlowFile is a data record, which consists of a pointer to its content (payload) and attributes to support the content, that is associated with one or more provenance events. The attributes are key/value pairs that act as the metadata for the FlowFile, such as the FlowFile filename. The content is the actual data or the payload of the file. Provenance is a record of whatâs happened to the FlowFile. Each one of these parts has its own repository (repo) for storage.</p> +<p>FlowFiles are at the heart of NiFi and its flow-based design. A FlowFile is a data record, which consists of a pointer to its content (payload) and attributes to support the content, that is associated with one or more provenance events. The attributes are key/value pairs that act as the metadata for the FlowFile, such as the FlowFile filename. The content is the actual data or the payload of the file. Provenance is a record of what has happened to the FlowFile. Each one of these parts has its own repository (repo) for storage.</p> </div> <div class="paragraph"> -<p>One key aspect of the repositories is immutability. The content in the Content Repository and data within the FlowFile Repository are immutable. When a change occurs to the attributes of a FlowFile new copies of the attributes are created in memory and then persisted on disk. When content is being changed for a given FlowFile its original content is read, streamed through the transform, and written to a new stream. Then the FlowFile’s content pointer is updated to the new location on disk. As a result, the default approach for FlowFile content storage can be said to be an immutable versioned content store. The benefits of which are many including substantial reduction in storage space required for the typical complex graphs of processing, natural replay capability, takes advantage of OS caching, reduces random read/write performance hits, and is easy to reason over. The previous revisions are kept according to the archiving properties set in nifi.properties file and outlin ed in the NiFi System Administratorâs Guide.</p> +<p>One key aspect of the repositories is immutability. The content in the Content Repository and data within the FlowFile Repository are immutable. When a change occurs to the attributes of a FlowFile, new copies of the attributes are created in memory and then persisted on disk. When content is being changed for a given FlowFile, its original content is read, streamed through the transform, and written to a new stream. Then the FlowFile’s content pointer is updated to the new location on disk. As a result, the default approach for FlowFile content storage can be said to be an immutable versioned content store. The benefits of this are many, including: substantial reduction in storage space required for the typical complex graphs of processing, natural replay capability, takes advantage of OS caching, reduces random read/write performance hits, and is easy to reason over. The previous revisions are kept according to the archiving properties set in <em>nifi.properties</em> fil e and outlined in the <a href="administration-guide.html">NiFi System Administrator’s Guide</a>.</p> </div> </div> </div> <div class="sect1"> -<h2 id="repositories"><a class="anchor" href="nifi-in-depth.html#repositories"></a>Repositories</h2> +<h2 id="repositories"><a class="anchor" href="#repositories"></a>Repositories</h2> <div class="sectionbody"> <div class="paragraph"> <p>There are three repositories that are utilized by NiFi. Each exists within the OS/Host’s file system and provides specific functionality. In order to fully understand FlowFiles and how they are used by the underlying system it’s important to know about these repositories. All three repositories are directories on local storage that NiFi uses to persist data.</p> @@ -516,40 +516,40 @@ body.book #toc,body.book #preamble,body. </div> <div class="imageblock"> <div class="content"> -<img src="images/NiFiArchitecture.png" alt="NiFi Architecture Diagram"> +<img src="./images/zero-master-node.png" alt="NiFi Architecture Diagram"> </div> </div> <div class="sect2"> -<h3 id="flowfile-repository"><a class="anchor" href="nifi-in-depth.html#flowfile-repository"></a>FlowFile Repository</h3> +<h3 id="flowfile-repository"><a class="anchor" href="#flowfile-repository"></a>FlowFile Repository</h3> <div class="paragraph"> -<p>FlowFiles that are actively being processed by the system is held in a hash map in the JVM memory (more about that in "Deeper View: FlowFiles in Memory and on Disk"). This makes it very efficient to process them, but requires a secondary mechanism to provide durability of data across process restarts due to any number of reasons. Reasons such as power loss, kernel panics, system upgrades, and maintenance cycles. The FlowFile Repository is a "Write-Ahead Log" (or data record) of the metadata of each of the FlowFiles that currently exist in the system. This FlowFile metadata includes all the attributes associated with the FlowFile, a pointer to the actual content of the FlowFile (which exists in the Content Repo) and the state of the FlowFile, such as which Connection/Queue the FlowFile belongs in. This Write-Ahead Log provides NiFi the resiliency it needs to handle restarts and unexpected system failures.</p> +<p>FlowFiles that are actively being processed by the system are held in a hash map in the JVM memory (more about that in <a href="#DeeperView">Deeper View: FlowFiles in Memory and on Disk</a>). This makes it very efficient to process them, but requires a secondary mechanism to provide durability of data across process restarts due to a number of reasons, such as power loss, kernel panics, system upgrades, and maintenance cycles. The FlowFile Repository is a "Write-Ahead Log" (or data record) of the metadata of each of the FlowFiles that currently exist in the system. This FlowFile metadata includes all the attributes associated with the FlowFile, a pointer to the actual content of the FlowFile (which exists in the Content Repo) and the state of the FlowFile, such as which Connection/Queue the FlowFile belongs in. This Write-Ahead Log provides NiFi the resiliency it needs to handle restarts and unexpected system failures.</p> </div> <div class="paragraph"> -<p>The FlowFile Repository acts as NiFi’s Write-Ahead Log, so as the FlowFiles are flowing through the system each change is logged in the FlowFile Repository before it happens as a transactional unit of work. This allows the system to know exactly what step the node is on when processing a piece of data. If the node goes down while processing the data, it can easily resume from where it left off upon restart (more in-depth in "Effect of System Failure on Transactions"). The format of the FlowFiles in the log is a series of deltas (or changes) that happened along the way. NiFi recovers a FlowFile by restoring a âsnapshotâ of the FlowFile (created when the Repository is check-pointed) and then replaying each of these deltas.</p> +<p>The FlowFile Repository acts as NiFi’s Write-Ahead Log, so as the FlowFiles are flowing through the system, each change is logged in the FlowFile Repository before it happens as a transactional unit of work. This allows the system to know exactly what step the node is on when processing a piece of data. If the node goes down while processing the data, it can easily resume from where it left off upon restart (more in-depth in <a href="#EffectSystemFailure">Effect of System Failure on Transactions</a>). The format of the FlowFiles in the log is a series of deltas (or changes) that happened along the way. NiFi recovers a FlowFile by restoring a âsnapshotâ of the FlowFile (created when the Repository is check-pointed) and then replaying each of these deltas.</p> </div> <div class="paragraph"> <p>A snapshot is automatically taken periodically by the system, which creates a new snapshot for each FlowFile. The system computes a new base checkpoint by serializing each FlowFile in the hash map and writing it to disk with the filename ".partial". As the checkpointing proceeds, the new FlowFile baselines are written to the ".partial" file. Once the checkpointing is done the old "snapshot" file is deleted and the ".partial" file is renamed "snapshot".</p> </div> <div class="paragraph"> -<p>The period between system checkpoints is configurable in the nifi.properties file (documented in the NiFi System Administrator’s Guide). The default is a two-minute interval.</p> +<p>The period between system checkpoints is configurable in the <em>nifi.properties</em> file (documented in the <a href="administration-guide.html">NiFi System Administrator’s Guide</a>). The default is a two-minute interval.</p> </div> <div class="sect3"> -<h4 id="effect-of-system-failure-on-transactions"><a class="anchor" href="nifi-in-depth.html#effect-of-system-failure-on-transactions"></a>Effect of System Failure on Transactions</h4> +<h4 id="EffectSystemFailure"><a class="anchor" href="#EffectSystemFailure"></a>Effect of System Failure on Transactions</h4> <div class="paragraph"> -<p>NiFi protects against hardware and system failures by keeping a record of what was happening on each node at that time in their respective FlowFile Repo. As mentioned above, the FlowFile Repo is NiFi’s Write-Ahead Log. When the node comes back online, it works to restore its state by first checking for the "snapshot" and ".partial" files. The node either accepts the "snapshot" and deletes the ".partial" (if it exits), or renames the ".partial" file to "snapshot" if the "snapshot" file doesn’t exist.</p> +<p>NiFi protects against hardware and system failures by keeping a record of what was happening on each node at that time in their respective FlowFile Repo. As mentioned above, the FlowFile Repo is NiFi’s Write-Ahead Log. When the node comes back online, it works to restore its state by first checking for the "snapshot" and ".partial" files. The node either accepts the "snapshot" and deletes the ".partial" (if it exists), or renames the ".partial" file to "snapshot" if the "snapshot" file doesn’t exist.</p> </div> <div class="paragraph"> <p>If the Node was in the middle of writing content when it went down, nothing is corrupted, thanks to the Copy On Write (mentioned below) and Immutability (mentioned above) paradigms. Since FlowFile transactions never modify the original content (pointed to by the content pointer), the original is safe. When NiFi goes down, the write claim for the change is orphaned and then cleaned up by the background garbage collection. This provides a ârollbackâ to the last known stable state.</p> </div> <div class="paragraph"> -<p>The Node then restores its state from the FlowFile. For a more in-depth, step-by-step explanation of the process, see this link: <a href="https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-Ahead+Log+Implementation" class="bare">https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-Ahead+Log+Implementation</a></p> +<p>The Node then restores its state from the FlowFile. For a more in-depth, step-by-step explanation of the process, see this link: <a href="https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-Ahead+Log+Implementation" class="bare">https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-Ahead+Log+Implementation</a> .</p> </div> <div class="paragraph"> <p>This setup, in terms of transactional units of work, allows NiFi to be very resilient in the face of adversity, ensuring that even if NiFi is suddenly killed, it can pick back up without any loss of data.</p> </div> </div> <div class="sect3"> -<h4 id="deeper-view-flowfiles-in-memory-and-on-disk"><a class="anchor" href="nifi-in-depth.html#deeper-view-flowfiles-in-memory-and-on-disk"></a>Deeper View: FlowFiles in Memory and on Disk</h4> +<h4 id="DeeperView"><a class="anchor" href="#DeeperView"></a>Deeper View: FlowFiles in Memory and on Disk</h4> <div class="paragraph"> <p>The term "FlowFile" is a bit of a misnomer. This would lead one to believe that each FlowFile corresponds to a file on disk, but that is not true. There are two main locations that the FlowFile attributes exist, the Write-Ahead Log that is explained above and a hash map in working memory. This hash map has a reference to all of the FlowFiles actively being used in the Flow. The object referenced by this map is the same one that is used by processors and held in connections queues. Since the FlowFile object is held in memory, all which has to be done for the Processor to get the FlowFile is to ask the ProcessSession to grab it from the queue.</p> </div> @@ -562,15 +562,15 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect2"> -<h3 id="content-repository"><a class="anchor" href="nifi-in-depth.html#content-repository"></a>Content Repository</h3> +<h3 id="content-repository"><a class="anchor" href="#content-repository"></a>Content Repository</h3> <div class="paragraph"> <p>The Content Repository is simply a place in local storage where the content of all FlowFiles exists and it is typically the largest of the three Repositories. As mentioned in the introductory section, this repository utilizes the immutability and copy-on-write paradigms to maximize speed and thread-safety. The core design decision influencing the Content Repo is to hold the FlowFile’s content on disk and only read it into JVM memory when it’s needed. This allows NiFi to handle tiny and massive sized objects without requiring producer and consumer processors to hold the full objects in memory. As a result, actions like splitting, aggregating, and transforming very large objects are quite easy to do without harming memory.</p> </div> <div class="paragraph"> -<p>In the same way the JVM Heap has a garbage collection process to reclaim unreachable objects when space is needed, there exists a dedicated thread in NiFi to analyze the Content repo for un-used content (more info in the " Deeper View: Deletion After Checkpointing" section). After a FlowFile’s content is identified as no longer in use it will either be deleted or archived. If archiving is enabled in nifi.properties then the FlowFileâs content will exist in the Content Repo either until it is aged off (deleted after a certain amount of time) or deleted due to the Content Repo taking up too much space. The conditions for archiving and/or deleting are configured in the nifi.properties file ("nifi.content.repository.archive.max.retention.period", "nifi.content.repository.archive.max.usage.percentage") and outlined in the Admin guide. Refer to the "Data Egress" section for more information on the deletion of content.</p> +<p>In the same way the JVM Heap has a garbage collection process to reclaim unreachable objects when space is needed, there exists a dedicated thread in NiFi to analyze the Content repo for un-used content (more info in the " Deeper View: Deletion After Checkpointing" section). After a FlowFile’s content is identified as no longer in use it will either be deleted or archived. If archiving is enabled in <em>nifi.properties</em> then the FlowFileâs content will exist in the Content Repo either until it is aged off (deleted after a certain amount of time) or deleted due to the Content Repo taking up too much space. The conditions for archiving and/or deleting are configured in the <em>nifi.properties</em> file ("nifi.content.repository.archive.max.retention.period", "nifi.content.repository.archive.max.usage.percentage") and outlined in the <a href="administration-guide.html">NiFi System Administrator’s Guide</a>. Refer to the "Data Egress" section for more informatio n on the deletion of content.</p> </div> <div class="sect3"> -<h4 id="deeper-view-content-claim"><a class="anchor" href="nifi-in-depth.html#deeper-view-content-claim"></a>Deeper View: Content Claim</h4> +<h4 id="deeper-view-content-claim"><a class="anchor" href="#deeper-view-content-claim"></a>Deeper View: Content Claim</h4> <div class="paragraph"> <p>In general, when talking about a FlowFile, the reference to its content can simply be referred to as a "pointer" to the content. Though, the underlying implementation of the FlowFile Content reference has multiple layers of complexity. The Content Repository is made up of a collection of files on disk. These files are binned into Containers and Sections. A Section is a subdirectory of a Container. A Container can be thought of as a âroot directoryâ for the Content Repository. The Content Repository, though, can be made up of many Containers. This is done so that NiFi can take advantage of multiple physical partitions in parallel.â NiFi is then capable of reading from, and writing to, all of these disks in parallel, in order to achieve data rates of hundreds of Megabytes or even Gigabytes per second of disk throughput on a single node. "Resource Claims" are Java objects that point to specific files on disk (this is done by keeping track of the file ID, the section t he file is in, and the container the section is a part of).</p> </div> @@ -583,26 +583,35 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect2"> -<h3 id="provenance-repository"><a class="anchor" href="nifi-in-depth.html#provenance-repository"></a>Provenance Repository</h3> +<h3 id="provenance-repository"><a class="anchor" href="#provenance-repository"></a>Provenance Repository</h3> <div class="paragraph"> -<p>The Provenance Repository is where the history of each FlowFile is stored. This history is used to provide the Data Lineage (also known as the Chain of Custody) of each piece of data. Each time that an event occurs for a FlowFile (FlowFile is created, forked, cloned, modified, etc.) a new provenance event is created. This provenance event is a snapshot of the FlowFile as it looked and fit in the flow that existed at that point in time. When a provenance event is created, it copies all the FlowFile’s attributes and the pointer to the FlowFile’s content and aggregates that with the FlowFile’s state (such as its relationship with other provenance events) to one location in the Provenance Repo. This snapshot will not change, with the exception of the data being expired. The Provenance Repository holds all of these provenance events for a period of time after completion, as specified in the nifi.properties file.</p> +<p>The Provenance Repository is where the history of each FlowFile is stored. This history is used to provide the Data Lineage (also known as the Chain of Custody) of each piece of data. Each time that an event occurs for a FlowFile (FlowFile is created, forked, cloned, modified, etc.) a new provenance event is created. This provenance event is a snapshot of the FlowFile as it looked and fit in the flow that existed at that point in time. When a provenance event is created, it copies all the FlowFile’s attributes and the pointer to the FlowFile’s content and aggregates that with the FlowFile’s state (such as its relationship with other provenance events) to one location in the Provenance Repo. This snapshot will not change, with the exception of the data being expired. The Provenance Repository holds all of these provenance events for a period of time after completion, as specified in the <em>nifi.properties</em> file.</p> </div> <div class="paragraph"> -<p>Because all of the FlowFile attributes and the a pointer to the content are kept in the Provenance Repository, a Dataflow Manager is able to not only see the lineage, or processing history, of that piece of data, but is also able to later view the data itself and even replay the data from any point in the flow. A common use-case for this is when a particular down-stream system claims to have not received the data. The data lineage can show exactly when the data was delivered to the downstream system, what the data looked like, the filename, and the URL that the data was sent to â or can confirm that the data was indeed never sent. In either case, the Send event can be replayed with the click of a button (or by accessing the appropriate HTTP API endpoint) in order to resend the data only to that particular downstream system. Alternatively, if the data was not handled properly (perhaps some data manipulation should have occurred first), the flow can be fixed and then the data can be replayed into the new flow, in order to process the data properly.</p> +<p>Because all of the FlowFile attributes and the pointer to the content are kept in the Provenance Repository, a Dataflow Manager is able to not only see the lineage, or processing history, of that piece of data, but is also able to later view the data itself and even replay the data from any point in the flow. A common use-case for this is when a particular down-stream system claims to have not received the data. The data lineage can show exactly when the data was delivered to the downstream system, what the data looked like, the filename, and the URL that the data was sent to â or can confirm that the data was indeed never sent. In either case, the Send event can be replayed with the click of a button (or by accessing the appropriate HTTP API endpoint) in order to resend the data only to that particular downstream system. Alternatively, if the data was not handled properly (perhaps some data manipulation should have occurred first), the flow can be fixed and then the data ca n be replayed into the new flow, in order to process the data properly.</p> </div> <div class="paragraph"> <p>Keep in mind, though, that since Provenance is not copying the content in the Content Repo, and just copying the FlowFile’s pointer to the content, the content could be deleted before the provenance event that references it is deleted. This would mean that the user would no longer able to see the content or replay the FlowFile later on. However, users are still able to view the FlowFileâs lineage and understand what happened to the data. For instance, even though the data itself will not be accessible, the user is still able to see the unique identifier of the data, its filename (if applicable), when it was received, where it was received from, how it was manipulated, where it was sent, and so on. Additionally, since the FlowFileâs attributes are made available, a Dataflow Manager is able to understand why the data was processed in the way that it was, providing a crucial tool for understanding and debugging the dataflow.</p> </div> -<div class="paragraph"> -<p>Note: Since provenance events are snapshots of the FlowFile, as it exists in the current flow, changes to the flow may impact the ability to replay provenance events later on. For example, if a Connection is deleted from the flow, the data cannot be replayed from that point in the flow, since there is now nowhere to enqueue the data for processing.</p> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +Since provenance events are snapshots of the FlowFile, as it exists in the current flow, changes to the flow may impact the ability to replay provenance events later on. For example, if a Connection is deleted from the flow, the data cannot be replayed from that point in the flow, since there is now nowhere to enqueue the data for processing. +</td> +</tr> +</table> </div> <div class="paragraph"> <p>For a look at the design decisions behind the Provenance Repository check out this link: <a href="https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance+Repository+Design" class="bare">https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance+Repository+Design</a></p> </div> <div class="sect3"> -<h4 id="deeper-view-provenance-log-files"><a class="anchor" href="nifi-in-depth.html#deeper-view-provenance-log-files"></a>Deeper View: Provenance Log Files</h4> +<h4 id="deeper-view-provenance-log-files"><a class="anchor" href="#deeper-view-provenance-log-files"></a>Deeper View: Provenance Log Files</h4> <div class="paragraph"> -<p>Each provenance event has two maps, one for the attributes before the event and one for the updated attribute values. In general, provenance events don’t store the updated values of the attributes as they existed when the event was emitted but instead, the attribute values when the session is committed. The events are cached and saved until the session is committed and once the session is committed the events are emitted with the attributes associated with the FlowFile when the session is committed. The exception to this rule is the "SEND" event, in which case the event contains the attributes as they existed when the event was emitted. This is done because if the attributes themselves were also sent, it is important to have an accurate account of exactly what information was sent.</p> +<p>Each provenance event has two maps, one for the attributes before the event and one for the updated attribute values. In general, provenance events don’t store the updated values of the attributes as they existed when the event was emitted, but instead, the attribute values when the session is committed. The events are cached and saved until the session is committed and once the session is committed the events are emitted with the attributes associated with the FlowFile when the session is committed. The exception to this rule is the "SEND" event, in which case the event contains the attributes as they existed when the event was emitted. This is done because if the attributes themselves were also sent, it is important to have an accurate account of exactly what information was sent.</p> </div> <div class="paragraph"> <p>As NiFi is running, there is a rolling group of 16 provenance log files. As provenance events are emitted they are written to one of the 16 files (there are multiple files to increase throughput). The log files are periodically rolled over (the default timeframe is every 30 seconds). This means the newly created provenance events start writing to a new group of 16 log files and the original ones are processed for long term storage. First the rolled over logs are merged into one file. Then the file is optionally compressed (determined by the "nifi.provenance.repository.compress.on.rollover" property). Lastly the events are indexed using Lucene and made available for querying. This batched approach for indexing means provenance events aren’t available immediately for querying but in return this dramatically increases performance because committing a transaction and indexing are very expensive tasks.</p> @@ -616,15 +625,15 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect2"> -<h3 id="general-repository-notes"><a class="anchor" href="nifi-in-depth.html#general-repository-notes"></a>General Repository Notes</h3> +<h3 id="general-repository-notes"><a class="anchor" href="#general-repository-notes"></a>General Repository Notes</h3> <div class="sect3"> -<h4 id="multiple-physical-storage-points"><a class="anchor" href="nifi-in-depth.html#multiple-physical-storage-points"></a>Multiple Physical Storage Points</h4> +<h4 id="multiple-physical-storage-points"><a class="anchor" href="#multiple-physical-storage-points"></a>Multiple Physical Storage Points</h4> <div class="paragraph"> -<p>For the Provenance and Content repos, there is the option to stripe the information across multiple physical partitions. An admin would do this if they wanted to federate reads and writes across multiple disks. The repo (Content or Provenance) is still one logical store but writes will be striped across multiple volumes/partitions automatically by the system. The directories are specified in the nifi.properties file.</p> +<p>For the Provenance and Content repos, there is the option to stripe the information across multiple physical partitions. An admin would do this if they wanted to federate reads and writes across multiple disks. The repo (Content or Provenance) is still one logical store but writes will be striped across multiple volumes/partitions automatically by the system. The directories are specified in the <em>nifi.properties</em> file.</p> </div> </div> <div class="sect3"> -<h4 id="best-practice"><a class="anchor" href="nifi-in-depth.html#best-practice"></a>Best Practice</h4> +<h4 id="best-practice"><a class="anchor" href="#best-practice"></a>Best Practice</h4> <div class="paragraph"> <p>It is considered a best practice to analyze the contents of a FlowFile as few times as possible and instead extract key information from the contents into the attributes of the FlowFile; then read/write information from the FlowFile attributes. One example of this is the ExtractText processor, which extracts text from the FlowFile Content and puts it as an attribute so other processors can make use of it. This provides far better performance than continually processing the entire content of the FlowFile, as the attributes are kept in-memory and updating the FlowFile repository is much faster than updating the Content repository, given the amount of data stored in each.</p> </div> @@ -633,36 +642,52 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect1"> -<h2 id="life-of-a-flowfile"><a class="anchor" href="nifi-in-depth.html#life-of-a-flowfile"></a>Life of a FlowFile</h2> +<h2 id="life-of-a-flowfile"><a class="anchor" href="#life-of-a-flowfile"></a>Life of a FlowFile</h2> <div class="sectionbody"> <div class="paragraph"> <p>To better understand how the repos interact with one another, the underlying functionality of NiFi, and the life of a FlowFile; this next section will include examples of a FlowFile at different points in a real flow. The flow is a template called "WebCrawler.xml" and is available here: <a href="https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates" class="bare">https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates</a>.</p> </div> <div class="paragraph"> -<p>At a high level, this template reaches out to a seed URL configured in the GetHTTP processor then analyzes the response using the RouteText processor to find instances of a keyword (in this case "nifi"), and potential URLs to hit. Then InvokeHTTP executes a HTTP Get request using the URLs found in the original seed web page. The response is routed based on the status code attribute and only 200-202 status codes are routed back to the original RouteText processor for analysis.</p> +<p>At a high level, this template reaches out to a seed URL configured in the GetHTTP processor, then analyzes the response using the RouteText processor to find instances of a keyword (in this case "nifi"), and potential URLs to hit. Then InvokeHTTP executes a HTTP Get request using the URLs found in the original seed web page. The response is routed based on the status code attribute and only 200-202 status codes are routed back to the original RouteText processor for analysis.</p> </div> <div class="paragraph"> <p>The flow also detects duplicate URLs and prevents processing them again, emails the user when keywords are found, logs all successful HTTP requests, and bundles up the successful requests to be compressed and archived on disk.</p> </div> -<div class="paragraph"> -<p>Note: To use this flow you need to configure a couple options. First a DistributedMapCacheServer controller service must be added with default properties. At the time of writing there was no way to explicitly add the controller service to the template and since no processors reference the service it is not included. Also to get emails, the PutEmail processor must be configured with your email credentials. Finally to use HTTPS the StandardSSLContextService must be configured with proper key and trust stores. Remember that the truststore must be configured with the proper Certificate Authorities in order to work for websites. The command below is an example of using the "keytool" command to add the default Java 1.8.0_60 CAs to a truststore called myTrustStore.</p> -</div> -<div class="paragraph"> -<p>keytool -importkeystore -srckeystore /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/cacerts -destkeystore myTrustStore</p> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +To use this flow you need to configure a couple options. First a DistributedMapCacheServer controller service must be added with default properties. At the time of writing there was no way to explicitly add the controller service to the template and since no processors reference the service it is not included. Also to get emails, the PutEmail processor must be configured with your email credentials. Finally, to use HTTPS the StandardSSLContextService must be configured with proper key and trust stores. Remember that the truststore must be configured with the proper Certificate Authorities in order to work for websites. The command below is an example of using the "keytool" command to add the default Java 1.8.0_60 CAs to a truststore called myTrustStore: +keytool -importkeystore -srckeystore /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/cacerts -destkeystore myTrustStore +</td> +</tr> +</table> </div> <div class="sect2"> -<h3 id="webcrawler-template"><a class="anchor" href="nifi-in-depth.html#webcrawler-template"></a>WebCrawler Template:</h3> -<div class="paragraph"> -<p>Note that it is not uncommon for bulletins with messages such as "Connection timed out" to appear on the InvokeHttp processor due to the random nature of web crawling.</p> -</div> +<h3 id="webcrawler-template"><a class="anchor" href="#webcrawler-template"></a>WebCrawler Template</h3> <div class="imageblock"> <div class="content"> -<img src="images/WebCrawler.png" alt="Web Crawler Flow"> +<img src="./images/WebCrawler.png" alt="Web Crawler Flow"> +</div> </div> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +It is not uncommon for bulletins with messages such as "Connection timed out" to appear on the InvokeHttp processor due to the random nature of web crawling. +</td> +</tr> +</table> </div> </div> <div class="sect2"> -<h3 id="data-ingress"><a class="anchor" href="nifi-in-depth.html#data-ingress"></a>Data Ingress</h3> +<h3 id="data-ingress"><a class="anchor" href="#data-ingress"></a>Data Ingress</h3> <div class="paragraph"> <p>A FlowFile is created in the system when a producer processor invokes "ProcessSession.create()" followed by an appropriate call to the ProvenanceReporter. The "ProcessSession.create()" call creates an empty FlowFile with a few core attributes (filename, path and uuid for the standard process session) but without any content or lineage to parents (the create method is overloaded to allow parameters for parent FlowFiles). The producer processor then adds the content and attributes to the FlowFile.</p> </div> @@ -674,32 +699,32 @@ body.book #toc,body.book #preamble,body. </div> <div class="imageblock"> <div class="content"> -<img src="images/DataIngress.png" alt="Data Ingress"> +<img src="./images/DataIngress.png" alt="Data Ingress"> </div> </div> </div> <div class="sect2"> -<h3 id="pass-by-reference"><a class="anchor" href="nifi-in-depth.html#pass-by-reference"></a>Pass by Reference</h3> +<h3 id="pass-by-reference"><a class="anchor" href="#pass-by-reference"></a>Pass by Reference</h3> <div class="paragraph"> <p>An important aspect of flow-based programming is the idea of resource-constrained relationships between the black boxes. In NiFi these are queues and processors respectively. FlowFiles are routed from one processor to another through queues simply by passing a reference to the FlowFile (similar to the "Claim Check" pattern in EIP).</p> </div> <div class="paragraph"> -<p>In the WebCrawler flow, the InvokeHTTP processor reaches out to the URL with an HTTP GET request and adds a status code attribute to the FlowFile depending on what the response was from the HTTP server. After updating the FlowFile’s filename (in the UpdateAttribute processor after InvokeHttp) there is a RouteOnAttribute processor that routes FlowFiles with successful status code attributes to two different processors. Those that are unmatched are "DROPPED" (See the Data Egress section) by the RouteOnAttribute Processor, because it is configured to Auto-Terminate any data that does not match any of the routing rules. Coming in to the RouteOnAttribute processor there is a FlowFile (F¬1) that contains the status code attribute and points to the Content (C1). There is a provenance event that points to C1 and includes a snapshot of F1 but is omitted to better focus on the routing. This information is located in the FlowFile, Content and Provenance Repos respectively.</p> +<p>In the WebCrawler flow, the InvokeHTTP processor reaches out to the URL with an HTTP GET request and adds a status code attribute to the FlowFile depending on what the response was from the HTTP server. After updating the FlowFile’s filename (in the UpdateAttribute processor after InvokeHttp) there is a RouteOnAttribute processor that routes FlowFiles with successful status code attributes to two different processors. Those that are unmatched are "DROPPED" (See the Data Egress section) by the RouteOnAttribute Processor, because it is configured to Auto-Terminate any data that does not match any of the routing rules. Coming in to the RouteOnAttribute processor there is a FlowFile (F1) that contains the status code attribute and points to the Content (C1). There is a provenance event that points to C1 and includes a snapshot of F1 but is omitted to better focus on the routing. This information is located in the FlowFile, Content and Provenance Repos respectively.</p> </div> <div class="paragraph"> <p>After the RouteOnAttribute processor examines the FlowFile’s status code attribute it determines that it should be routed to two different locations. The first thing that happens is the processor clones the FlowFile to create F2. This copies all of the attributes and the pointer to the content. Since it is merely routing and analyzing the attributes, the content does not change. The FlowFiles are then added to the respective connection queue to wait for the next processor to retrieve them for processing.</p> </div> <div class="paragraph"> -<p>The ProvenanceReporter documents the changes that occurred which includes a CLONE and two ROUTE events. Each of these events has a pointer to the relevant content and contains a copy of the respective FlowFiles in the form of a snapshot.</p> +<p>The ProvenanceReporter documents the changes that occurred, which includes a CLONE and two ROUTE events. Each of these events has a pointer to the relevant content and contains a copy of the respective FlowFiles in the form of a snapshot.</p> </div> <div class="imageblock"> <div class="content"> -<img src="images/PassByReference.png" alt="Pass By Reference"> +<img src="./images/PassByReference.png" alt="Pass By Reference"> </div> </div> </div> <div class="sect2"> -<h3 id="extended-routing-use-cases"><a class="anchor" href="nifi-in-depth.html#extended-routing-use-cases"></a>Extended Routing Use-cases:</h3> +<h3 id="extended-routing-use-cases"><a class="anchor" href="#extended-routing-use-cases"></a>Extended Routing Use-cases</h3> <div class="paragraph"> <p>In addition to routing FlowFiles based on attributes, some processors also route based on content. While it is not as efficient, sometimes it is necessary because you want to split up the content of the FlowFile into multiple FlowFiles.</p> </div> @@ -711,87 +736,114 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect2"> -<h3 id="funnels"><a class="anchor" href="nifi-in-depth.html#funnels"></a>Funnels</h3> +<h3 id="funnels"><a class="anchor" href="#funnels"></a>Funnels</h3> <div class="paragraph"> -<p>The funnel is a component that takes input from one or more connections and routes them to one or more destinations. The typical use-cases of which are described in the User Guide. Regardless of use-case, if there is only one processor downstream from the funnel then there are not any provenance events emitted by the funnel and it appears to be invisible in the Provenance graph. If there are multiple downstream processors, like the one in WebCrawler, then a clone event occurs. Referring to the graphic below, you can see that a new FlowFile (F¬2) is cloned from the original FlowFile (F1) and, just like the Routing above, the new FlowFile just has a pointer to the same content (the content is not copied).</p> +<p>The funnel is a component that takes input from one or more connections and routes them to one or more destinations. The typical use-cases of which are described in the User Guide. Regardless of use-case, if there is only one processor downstream from the funnel then there are no provenance events emitted by the funnel and it appears to be invisible in the Provenance graph. If there are multiple downstream processors, like the one in WebCrawler, then a clone event occurs. Referring to the graphic below, you can see that a new FlowFile (F2) is cloned from the original FlowFile (F1) and, just like the Routing above, the new FlowFile just has a pointer to the same content (the content is not copied).</p> </div> <div class="paragraph"> <p>From a developer point of view, you can view a Funnel just as a very simple processor. When it is scheduled to run, it simply does a "ProcessSession.get()" and then "ProcessSession.transfer()" to the output connection . If there is more than one output connection (like the example below) then a "ProcessSession.clone()" is run. Finally a "ProcessSession.commit()" is called, completing the transaction.</p> </div> <div class="imageblock"> <div class="content"> -<img src="images/Funnels.png" alt="Funnel"> +<img src="./images/Funnels.png" alt="Funnel"> </div> </div> </div> <div class="sect2"> -<h3 id="copy-on-write"><a class="anchor" href="nifi-in-depth.html#copy-on-write"></a>Copy on Write</h3> +<h3 id="copy-on-write"><a class="anchor" href="#copy-on-write"></a>Copy on Write</h3> <div class="paragraph"> <p>In the previous example, there was only routing but no changes to the content of the FlowFile. This next example focuses on the CompressContent processor of the template that compresses the bundle of merged FlowFiles containing webpages that were queued to be analyzed.</p> </div> <div class="paragraph"> -<p>In this example, the content C1 for FlowFile F1 is being compressed in the CompressContent processor. Since C1 is immutable and we want a full re-playable provenance history we can’t just overwrite C1. In order to "modify" C1 we do a "copy on write", which we accomplish by modifying the content as it is copied to a new location within the content repository. When doing so, FlowFile reference F1 is updated to point to the new compressed content C2 and a new Provenance Event P2 is created referencing the new FlowFile F1.1. Because the FlowFile repo is immutable, instead of modifying the old F1 a new delta (F1.1) is created. Previous provenance events still have the pointer to the Content C1 and contain old attributes but they are not the most up-to-date version of the FlowFile.</p> +<p>In this example, the content C1 for FlowFile F1 is being compressed in the CompressContent processor. Since C1 is immutable and we want a full re-playable provenance history we can’t just overwrite C1. In order to "modify" C1 we do a "copy on write", which we accomplish by modifying the content as it is copied to a new location within the content repository. When doing so, FlowFile reference F1 is updated to point to the new compressed content C2 and a new Provenance Event P2 is created referencing the new FlowFile F1.1. Because the FlowFile repo is immutable, instead of modifying the old F1, a new delta (F1.1) is created. Previous provenance events still have the pointer to the Content C1 and contain old attributes, but they are not the most up-to-date version of the FlowFile.</p> </div> -<div class="paragraph"> -<p>Note: For the sake of focusing on the Copy on Write event, the FlowFile’s (F1) provenance events leading up to this point are omitted.</p> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +For the sake of focusing on the Copy on Write event, the FlowFile’s (F1) provenance events leading up to this point are omitted. +</td> +</tr> +</table> </div> <div class="imageblock"> <div class="content"> -<img src="images/CopyOnWrite.png" alt="Copy On Write"> +<img src="./images/CopyOnWrite.png" alt="Copy On Write"> </div> </div> <div class="sect3"> -<h4 id="extended-copy-on-write-use-case"><a class="anchor" href="nifi-in-depth.html#extended-copy-on-write-use-case"></a>Extended Copy on Write Use-case</h4> +<h4 id="extended-copy-on-write-use-case"><a class="anchor" href="#extended-copy-on-write-use-case"></a>Extended Copy on Write Use-case</h4> <div class="paragraph"> <p>A unique case of Copy on Write is the MergeContent processor. Just about every processor only acts on one FlowFile at a time. The MergeContent processor is unique in that it takes in multiple FlowFiles and combines them into one. Currently, MergeContent has multiple different Merge Strategies but all of them require the contents of the input FlowFiles to be copied to a new merged location. After MergeContent finishes, it emits a provenance event of type "JOIN" that establishes that the given parents were joined together to create a new child FlowFile.</p> </div> </div> </div> <div class="sect2"> -<h3 id="updating-attributes"><a class="anchor" href="nifi-in-depth.html#updating-attributes"></a>Updating Attributes</h3> +<h3 id="updating-attributes"><a class="anchor" href="#updating-attributes"></a>Updating Attributes</h3> <div class="paragraph"> -<p>Working with a FlowFile’s attributes is a core aspect of NiFi. It is assumed that attributes are small enough to be entirely read into local memory every time a processor executes on it. So it is important that they are easy to work with. As attributes are the core way of routing and processing a FlowFile it is very common to have processors that just change a FlowFile’s attributes. One such example is the UpdateAttribute processor. All the UpdateAttribute processor does is change the incoming FlowFile’s attributes according to the processor’s properties.</p> +<p>Working with a FlowFile’s attributes is a core aspect of NiFi. It is assumed that attributes are small enough to be entirely read into local memory every time a processor executes on it. So it is important that they are easy to work with. As attributes are the core way of routing and processing a FlowFile, it is very common to have processors that just change a FlowFile’s attributes. One such example is the UpdateAttribute processor. All the UpdateAttribute processor does is change the incoming FlowFile’s attributes according to the processor’s properties.</p> </div> <div class="paragraph"> <p>Taking a look at the diagram, before the processor there is the FlowFile (F1) that has attributes and a pointer to the content (C1). The processor updates the FlowFile’s attributes by creating a new delta (F1.1) that still has a pointer to the content (C1). An âATTRIBUTES_MODIFIEDâ provenance event is emitted when this happens.</p> </div> <div class="paragraph"> -<p>In this example, the previous processor (InvokeHTTP) fetched information from a URL and created a new response FlowFile with a filename attribute that is the same as the request FlowFile. This does not help describe the response FlowFile so the UpdateAttribute processor modifies the filename attribute to something more relevant (URL and transaction ID).</p> +<p>In this example, the previous processor (InvokeHTTP) fetched information from a URL and created a new response FlowFile with a filename attribute that is the same as the request FlowFile. This does not help describe the response FlowFile, so the UpdateAttribute processor modifies the filename attribute to something more relevant (URL and transaction ID).</p> </div> -<div class="paragraph"> -<p>Note: For the sake of focusing on the ATTRIBUTES_MODIFIED event the FlowFile’s (F1) provenance events leading up to this point are omitted.</p> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +For the sake of focusing on the ATTRIBUTES_MODIFIED event the FlowFile’s (F1) provenance events leading up to this point are omitted. +</td> +</tr> +</table> </div> <div class="imageblock"> <div class="content"> -<img src="images/UpdatingAttributes.png" alt="Updating Attributes"> +<img src="./images/UpdatingAttributes.png" alt="Updating Attributes"> </div> </div> <div class="sect3"> -<h4 id="typical-use-case-note"><a class="anchor" href="nifi-in-depth.html#typical-use-case-note"></a>Typical Use-case Note</h4> +<h4 id="typical-use-case-note"><a class="anchor" href="#typical-use-case-note"></a>Typical Use-case Note</h4> <div class="paragraph"> <p>In addition to adding arbitrary attributes via UpdateAttribute, extracting information from the content of a FlowFile into the attributes is a very common use-case. One such example in the Web Crawler flow is the ExtractText processor. We cannot use the URL when it is embedded within the content of the FlowFile, so we much extract the URL from the contents of the FlowFile and place it as an attribute. This way we can use the Expression Language to reference this attribute in the URL Property of InvokeHttp.</p> </div> </div> </div> <div class="sect2"> -<h3 id="data-egress"><a class="anchor" href="nifi-in-depth.html#data-egress"></a>Data Egress</h3> +<h3 id="data-egress"><a class="anchor" href="#data-egress"></a>Data Egress</h3> <div class="paragraph"> -<p>Eventually data in NiFi will reach a point where it has either been loaded into another system and we can stop processing it, or we filtered the FlowFile out and determined we no longer care about it. Either way, the FlowFile will eventually be "DROPPED". "DROP" is a provenance event meaning that we are no longer processing the FlowFile in the Flow and it is available for deletion. It remains in the FlowFile Repository until the next repository checkpoint. The Provenance Repository keeps the Provenance events for an amount of time stated in nifi.properties (default is 24 hours). The content in the Content Repo is marked for deletion once the FlowFile leaves NiFi and the background checkpoint processing of the Write-Ahead Log to compact/remove occurs. That is unless another FlowFile references the same content or if archiving is enabled in nifi.properties. If archiving is enabled, the content exists until either the max percentage of disk is reached or max retention period is rea ched (also set in nifi.properties).</p> +<p>Eventually data in NiFi will reach a point where it has either been loaded into another system and we can stop processing it, or we filtered the FlowFile out and determined we no longer care about it. Either way, the FlowFile will eventually be "DROPPED". "DROP" is a provenance event meaning that we are no longer processing the FlowFile in the Flow and it is available for deletion. It remains in the FlowFile Repository until the next repository checkpoint. The Provenance Repository keeps the Provenance events for an amount of time stated in <em>nifi.properties</em> (default is 24 hours). The content in the Content Repo is marked for deletion once the FlowFile leaves NiFi and the background checkpoint processing of the Write-Ahead Log to compact/remove occurs. That is unless another FlowFile references the same content or if archiving is enabled in <em>nifi.properties</em>. If archiving is enabled, the content exists until either the max percentage of disk is reached or max reten tion period is reached (also set in <em>nifi.properties</em>).</p> </div> <div class="sect3"> -<h4 id="deeper-view-deletion-after-checkpointing"><a class="anchor" href="nifi-in-depth.html#deeper-view-deletion-after-checkpointing"></a>Deeper View: Deletion After Checkpointing</h4> -<div class="paragraph"> -<p>Note: This section relies heavily on information from the "Deeper View: Content Claim" section above.</p> +<h4 id="deeper-view-deletion-after-checkpointing"><a class="anchor" href="#deeper-view-deletion-after-checkpointing"></a>Deeper View: Deletion After Checkpointing</h4> +<div class="admonitionblock note"> +<table> +<tr> +<td class="icon"> +<i class="fa icon-note" title="Note"></i> +</td> +<td class="content"> +This section relies heavily on information from the "Deeper View: Content Claim" section above. +</td> +</tr> +</table> </div> <div class="paragraph"> <p>Once the â.partialâ file is synchronized with the underlying storage mechanism and renamed to be the new snapshot (detailed in the FlowFile Repo section) there is a callback to the FlowFile Repo to release all the old content claims (this is done after checkpointing so that content is not lost if something goes wrong). The FlowFile Repo knows which Content Claims can be released and notifies the Resource Claim Manager. The Resource Claim Manager keeps track of all the content claims that have been released and which resource claims are ready to be deleted (a resource claim is ready to be deleted when there are no longer any FlowFiles referencing it in the flow).</p> </div> <div class="paragraph"> -<p>Periodically the Content Repo asks the Resource Claim Manager which Resource Claims can be cleaned up. The Content Repo then makes the decision whether the Resource Claims should be archived or deleted (based on the value of the "nifi.content.repository.archive.enabled" property in the ânifi.propertiesâ file). If archiving is disabled then the file is simply deleted from the disk. Otherwise, a background thread runs to see when archives should be deleted (based on the conditions above). This background thread keeps a list of the 10,000 oldest content claims and deletes them until below the necessary threshold. If it runs out of content claims it scans the repo for the oldest content to re-populate the list. This provides a model that is efficient in terms of both Java heap utilization as well as disk I/O utilization.</p> +<p>Periodically, the Content Repo asks the Resource Claim Manager which Resource Claims can be cleaned up. The Content Repo then makes the decision whether the Resource Claims should be archived or deleted (based on the value of the "nifi.content.repository.archive.enabled" property in the <em>nifi.properties</em> file). If archiving is disabled, then the file is simply deleted from the disk. Otherwise, a background thread runs to see when archives should be deleted (based on the conditions above). This background thread keeps a list of the 10,000 oldest content claims and deletes them until below the necessary threshold. If it runs out of content claims it scans the repo for the oldest content to re-populate the list. This provides a model that is efficient in terms of both Java heap utilization as well as disk I/O utilization.</p> </div> </div> <div class="sect3"> -<h4 id="associating-disparate-data"><a class="anchor" href="nifi-in-depth.html#associating-disparate-data"></a>Associating Disparate Data</h4> +<h4 id="associating-disparate-data"><a class="anchor" href="#associating-disparate-data"></a>Associating Disparate Data</h4> <div class="paragraph"> <p>One of the features of the Provenance Repository is that it allows efficient access to events that occur sequentially. A NiFi Reporting Task could then be used to iterate over these events and send them to an external service. If other systems are also sending similar types of events to this external system, it may be necessary to associate a NiFi FlowFile with another piece of information. For instance, if GetSFTP is used to retrieve data, NiFi refers to that FlowFile using its own, unique UUID. However, if the system that placed the file there referred to the file by filename, NiFi should have a mechanism to indicate that these are the same piece of data. This is accomplished by calling the ProvenanceReporter.associate() method and providing both the UUID of the FlowFile and the alternate name (the filename, in this example). Since the determination that two pieces of data are the same may be flow-dependent, it is often necessary for the DataFlow Manager to make this associatio n. A simple way of doing this is to use the UpdateAttribute processor and configure it to set the "alternate.identifier" attribute. This automatically emits the "associate" event, using whatever value is added as the âalternate.identifierâ attribute.</p> </div> @@ -800,20 +852,20 @@ body.book #toc,body.book #preamble,body. </div> </div> <div class="sect1"> -<h2 id="closing-remarks"><a class="anchor" href="nifi-in-depth.html#closing-remarks"></a>Closing Remarks</h2> +<h2 id="closing-remarks"><a class="anchor" href="#closing-remarks"></a>Closing Remarks</h2> <div class="sectionbody"> <div class="paragraph"> <p>Utilizing the copy-on-write, pass-by-reference, and immutability concepts in conjunction with the three repositories, NiFi is a fast, efficient, and robust enterprise dataflow platform. This document has covered specific implementations of pluggable interfaces. These include the Write-Ahead Log based implementation of the FlowFile Repository, the File based Provenance Repository, and the File based Content Repository. These implementations are the NiFi defaults but are pluggable so that, if needed, users can write their own to fulfill certain use-cases.</p> </div> <div class="paragraph"> -<p>Hopefully this document has given you a better understanding of the low-level functionality of NiFi and the decisions behind them. If there is something you wish to have explained more in depth or you feel should be included please feel free to send an email to the Apache NiFi Developer mailing list (<a href="mailto:[email protected]">[email protected]</a>).</p> +<p>Hopefully, this document has given you a better understanding of the low-level functionality of NiFi and the decisions behind them. If there is something you wish to have explained more in depth or you feel should be included please feel free to send an email to the Apache NiFi Developer mailing list (<a href="mailto:[email protected]">[email protected]</a>).</p> </div> </div> </div> </div> <div id="footer"> <div id="footer-text"> -Last updated 2016-08-29 08:05:45 -04:00 +Last updated 2016-11-26 01:07:10 -05:00 </div> </div> </body>
Modified: nifi/site/trunk/docs/nifi-docs/html/overview.html URL: http://svn.apache.org/viewvc/nifi/site/trunk/docs/nifi-docs/html/overview.html?rev=1771892&r1=1771891&r2=1771892&view=diff ============================================================================== --- nifi/site/trunk/docs/nifi-docs/html/overview.html (original) +++ nifi/site/trunk/docs/nifi-docs/html/overview.html Tue Nov 29 12:03:34 2016 @@ -455,18 +455,18 @@ body.book #toc,body.book #preamble,body. <div id="toc" class="toc"> <div id="toctitle">Table of Contents</div> <ul class="sectlevel1"> -<li><a href="overview.html#what-is-apache-nifi">What is Apache NiFi?</a></li> -<li><a href="overview.html#the-core-concepts-of-nifi">The core concepts of NiFi</a></li> -<li><a href="overview.html#nifi-architecture">NiFi Architecture</a></li> -<li><a href="overview.html#performance-expectations-and-characteristics-of-nifi">Performance Expectations and Characteristics of NiFi</a></li> -<li><a href="overview.html#high-level-overview-of-key-nifi-features">High Level Overview of Key NiFi Features</a></li> -<li><a href="overview.html#references">References</a></li> +<li><a href="#what-is-apache-nifi">What is Apache NiFi?</a></li> +<li><a href="#the-core-concepts-of-nifi">The core concepts of NiFi</a></li> +<li><a href="#nifi-architecture">NiFi Architecture</a></li> +<li><a href="#performance-expectations-and-characteristics-of-nifi">Performance Expectations and Characteristics of NiFi</a></li> +<li><a href="#high-level-overview-of-key-nifi-features">High Level Overview of Key NiFi Features</a></li> +<li><a href="#references">References</a></li> </ul> </div> </div> <div id="content"> <div class="sect1"> -<h2 id="what-is-apache-nifi"><a class="anchor" href="overview.html#what-is-apache-nifi"></a>What is Apache NiFi?</h2> +<h2 id="what-is-apache-nifi"><a class="anchor" href="#what-is-apache-nifi"></a>What is Apache NiFi?</h2> <div class="sectionbody"> <div class="paragraph"> <p>Put simply NiFi was built to automate the flow of data between systems. While @@ -476,7 +476,7 @@ problem space has been around ever since where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in -the <em>Enterprise Integration Patterns</em> <a href="overview.html#eip">[eip]</a>.</p> +the <em>Enterprise Integration Patterns</em> <a href="#eip">[eip]</a>.</p> </div> <div class="paragraph"> <p>Some of the high-level challenges of dataflow include:</p> @@ -518,8 +518,8 @@ the <em>Enterprise Integration Patterns< architecture. Now though there are a number of active and rapidly evolving movements making dataflow a lot more interesting and a lot more vital to the success of a given enterprise. These include things like; Service Oriented -Architecture <a href="overview.html#soa">[soa]</a>, the rise of the API <a href="overview.html#api">[api]</a><a href="overview.html#api2">[api2]</a>, Internet of Things <a href="overview.html#iot">[iot]</a>, -and Big Data <a href="overview.html#bigdata">[bigdata]</a>. In addition, the level of rigor necessary for +Architecture <a href="#soa">[soa]</a>, the rise of the API <a href="#api">[api]</a><a href="#api2">[api2]</a>, Internet of Things <a href="#iot">[iot]</a>, +and Big Data <a href="#bigdata">[bigdata]</a>. In addition, the level of rigor necessary for compliance, privacy, and security is constantly on the rise. Even still with all of these new concepts coming about, the patterns and needs of dataflow are still largely the same. The primary differences then are the scope of @@ -530,11 +530,11 @@ modern dataflow challenges.</p> </div> </div> <div class="sect1"> -<h2 id="the-core-concepts-of-nifi"><a class="anchor" href="overview.html#the-core-concepts-of-nifi"></a>The core concepts of NiFi</h2> +<h2 id="the-core-concepts-of-nifi"><a class="anchor" href="#the-core-concepts-of-nifi"></a>The core concepts of NiFi</h2> <div class="sectionbody"> <div class="paragraph"> <p>NiFi’s fundamental design concepts closely relate to the main ideas of Flow Based -Programming <a href="overview.html#fbp">[fbp]</a>. Here are some of +Programming <a href="#fbp">[fbp]</a>. Here are some of the main NiFi concepts and how they map to FBP:</p> </div> <table class="tableblock frame-all grid-rows spread"> @@ -561,7 +561,7 @@ content of zero or more bytes.</p></td> <tr> <td class="tableblock halign-left valign-top"><p class="tableblock">FlowFile Processor</p></td> <td class="tableblock halign-left valign-top"><p class="tableblock">Black Box</p></td> -<td class="tableblock halign-left valign-top"><p class="tableblock">Processors actually perform the work. In <a href="overview.html#eip">[eip]</a> terms a processor is +<td class="tableblock halign-left valign-top"><p class="tableblock">Processors actually perform the work. In <a href="#eip">[eip]</a> terms a processor is doing some combination of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work @@ -594,7 +594,7 @@ composition of other components.</p></td </tbody> </table> <div class="paragraph"> -<p>This design model, also similar to <a href="overview.html#seda">[seda]</a>, provides many beneficial consequences that help NiFi +<p>This design model, also similar to <a href="#seda">[seda]</a>, provides many beneficial consequences that help NiFi to be a very effective platform for building powerful and scalable dataflows. A few of these benefits include:</p> </div> @@ -626,11 +626,11 @@ A few of these benefits include:</p> </div> </div> <div class="sect1"> -<h2 id="nifi-architecture"><a class="anchor" href="overview.html#nifi-architecture"></a>NiFi Architecture</h2> +<h2 id="nifi-architecture"><a class="anchor" href="#nifi-architecture"></a>NiFi Architecture</h2> <div class="sectionbody"> <div class="imageblock"> <div class="content"> -<img src="images/zero-master-node.png" alt="NiFi Architecture Diagram"> +<img src="./images/zero-master-node.png" alt="NiFi Architecture Diagram"> </div> </div> <div class="paragraph"> @@ -670,7 +670,7 @@ components of NiFi on the JVM are as fol </div> <div class="imageblock"> <div class="content"> -<img src="images/zero-master-cluster.png" alt="NiFi Cluster Architecture Diagram"> +<img src="./images/zero-master-cluster.png" alt="NiFi Cluster Architecture Diagram"> </div> </div> <div class="paragraph"> @@ -679,7 +679,7 @@ components of NiFi on the JVM are as fol </div> </div> <div class="sect1"> -<h2 id="performance-expectations-and-characteristics-of-nifi"><a class="anchor" href="overview.html#performance-expectations-and-characteristics-of-nifi"></a>Performance Expectations and Characteristics of NiFi</h2> +<h2 id="performance-expectations-and-characteristics-of-nifi"><a class="anchor" href="#performance-expectations-and-characteristics-of-nifi"></a>Performance Expectations and Characteristics of NiFi</h2> <div class="sectionbody"> <div class="paragraph"> <p>NiFi is designed to fully leverage the capabilities of the underlying host system @@ -730,7 +730,7 @@ how well the application runs over time. </div> </div> <div class="sect1"> -<h2 id="high-level-overview-of-key-nifi-features"><a class="anchor" href="overview.html#high-level-overview-of-key-nifi-features"></a>High Level Overview of Key NiFi Features</h2> +<h2 id="high-level-overview-of-key-nifi-features"><a class="anchor" href="#high-level-overview-of-key-nifi-features"></a>High Level Overview of Key NiFi Features</h2> <div class="sectionbody"> <div class="paragraph"> <p>This sections provides a 20,000 foot view of NiFi’s cornerstone fundamentals, so that you can understand the Apache NiFi big picture, and some of its the most interesting features. The key features categories include flow management, ease of use, security, extensible architecture, and flexible scaling model.</p> @@ -881,7 +881,7 @@ about loading, and to exchange data on s </div> </div> <div class="sect1"> -<h2 id="references"><a class="anchor" href="overview.html#references"></a>References</h2> +<h2 id="references"><a class="anchor" href="#references"></a>References</h2> <div class="sectionbody"> <div class="ulist bibliography"> <ul class="bibliography"> @@ -916,7 +916,7 @@ about loading, and to exchange data on s </div> <div id="footer"> <div id="footer-text"> -Last updated 2016-08-29 08:05:45 -04:00 +Last updated 2016-11-26 01:07:10 -05:00 </div> </div> </body>
