endzeit created NIFI-12595:
------------------------------
Summary: Introduce an "Entity Tracking Strategy" to
ListedEntityTracker
Key: NIFI-12595
URL: https://issues.apache.org/jira/browse/NIFI-12595
Project: Apache NiFi
Issue Type: New Feature
Affects Versions: 1.24.0
Reporter: endzeit
h1. Situation
The existing {{ListX}} processors support different "Listing Strategies". One
commonly used "Listing Strategy" is "Tracking Entities" whereby crucial
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before,
the entity is not listed again (unless it was modified).
This has several benefits over other available "Listing Strategies". For
example, unlike with "No Tracking" the same entity is not listed repeatedly.
Other than "Tracking Timestamps", entities with an older timestamp than ones
previously listed can be picked up.
However, the strategy comes with its own problems.
h1. Problem
Due to the ever given constraints to available memory and performance, entities
cannot be tracked indefinitely.
That's why the {{ListedEntityTracker}}, used for implementing "Tracking
Entities" by most processors, introduces the notion of an "Entity Tracking Time
Window".
All remembered entities that are out of the time window (they are older than
the current time minus the time window) are removed from the tracking cache, to
limit memory use. Additionally, not yet listed entities that are out of the
time window are exempt from listing, as they would be removed from the "cache"
on the next run immediately, resulting in them being listed over and over.
However, this results in entities "older" than the specified "Entity Tracking
Time Window" not being picked up. For example, given entities are listed from a
remote server and this server is not available for some time. Once the server
is available again, the listing continues. However, all entities / files that
were created before the defined time window, will be silently ignored.
As of now, this can be solved by manual intervention, re-starting the ListX
processor. The
"Entity Tracking Time Window" can be ignored upon initial listing, when the
"Entity Tracking Initial Listing Target" is set to "All Available" (default).
However, this requires the NiFi user to be aware of lingering old entities
being available on the connected remote source. Additionally, the need for
manual intervention might be undesired / impractical when having a plentiful of
sources connected.
Additionally, the "Entity Tracking Time Window" can be increased to account for
longer time frames. However, this only betters the situation somewhat and does
not solve the problem. Also there is a limit to this, as it increases the
memory needed.
h1. Proposal
This issue proposes introducing the notion of a "Entity Tracking Strategy",
whereby the current behavior could be understand as "Tracking Time Window".
An new strategy of "Tracking Recently Listed" is added. Other than they
existing "Tracking Time Window" strategy, this would not impose any
prerequisites on the entities regarding they timestamp (see
{{minTimestampToList}}). Instead, all entities would be considered.
However, this strategy needs a way to limit / clean the entity cache as well.
Instead of removing entries that leave the time window, the strategy should
remember only the last N listed entities. That is, every time an entity is
listed, it is moved to the front of an "ordered list". In case the entity has
been listed before, its entry in the "list" is moved to the front nonetheless.
After every listing, only the first up to N entries are kept. All other, less
recently listed entities, are removed from cache.
While this strategy solves the problem of listing "old" entities it comes with
its own downsides. The NiFi user has to configure a sensible value for the
amount N of maximum cache entries. Failing to do so can result in listing an
entity more than once (similar to "No Tracking"), when the source provides more
than N entities at once (e.g. due to a load peak) and the entities are not
removed from the source until the next listing. Thus this approach works best
for Flows where the entity is removed from the source short after the listing.
This behavior should be mentioned in the documentation of the property.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)