[ 
https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

endzeit updated NIFI-12595:
---------------------------
    Description: 
h1. Situation

The existing {{ListX}} processors support different "Listing Strategies". One 
commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before, 
the entity is not listed again (unless it was modified).

This has several benefits over other available "Listing Strategies". For 
example, unlike with "No Tracking" the same entity is not listed repeatedly.  
Other than "Tracking Timestamps", entities with an older timestamp than ones 
previously listed can be picked up. 
However, the strategy comes with its own problems.

h1. Problem

Due to the ever given constraints to available memory and performance, entities 
cannot be tracked indefinitely.
That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
Entities" by most processors, introduces the notion of an "Entity Tracking Time 
Window".
All remembered entities that are out of the time window (they are older than 
the current time minus the time window) are removed from the tracking cache, to 
limit memory use. Additionally, not yet listed entities that are out of the 
time window are exempt from listing, as they would be removed from the "cache" 
on the next run immediately, resulting in them being listed over and over. 

However, this results in entities "older" than the specified "Entity Tracking 
Time Window" not being picked up. For example, given entities are listed from a 
remote server and this server is not available for some time. Once the server 
is available again, the listing continues. However, all entities / files that 
were created before the defined time window, will be silently ignored.

As of now, this can be solved by manual intervention, re-starting the ListX 
processor. The 
"Entity Tracking Time Window" can be ignored upon initial listing, when the 
"Entity Tracking Initial Listing Target" is set to "All Available" (default).

However, this requires the NiFi user to be aware of lingering old entities 
being available on the connected remote source. Additionally, the need for 
manual intervention might be undesired / impractical when having a plentiful of 
sources connected.

Additionally, the "Entity Tracking Time Window" can be increased to account for 
longer time frames. However, this only betters the situation somewhat and does 
not solve the problem. Also there is a limit to this, as it increases the 
memory needed.

h1. Proposal

This issue proposes introducing the notion of a "Entity Tracking Mode", whereby 
the current behavior could be understand as "Track Entity Timestamp".

An new mode of "Track Last Listing Time" is added. Other than the existing 
"Track Entity Timestamp" mode, this would not impose any prerequisites on the 
entities regarding they timestamp (see {{minTimestampToList}}). Instead, all 
entities would be considered. 
However, this strategy needs a way to limit / clean the entity cache as well. 
Instead of measuring the time window by the timestamp of the entity, the mode 
should remember the last time the entity was tracked; that is, part of a call 
to "listEntities" in "trackEntities". That is, every time an entity is listed, 
its cache entry is renewed. After every listing, only the cache entries that 
have been updated in the time window will be kept. All other, entities that 
have not been listed for a longer time, are removed from cache.

While this mode solves the problem of listing "old" entities it comes with its 
own downsides. Similar to the existing "Track Entity Timestamp" there is no 
upper limit on how many cache entries are possible, 
The NiFi user has to configure a sensible value for the amount N of maximum 
cache entries. Failing to do so can result in listing an entity more than once 
(similar to "No Tracking"), when the source provides more than N entities at 
once (e.g. due to a load peak) and the entities are not removed from the source 
until the next listing. Thus this approach works best for Flows where the 
entity is removed from the source short after the listing. This behavior should 
be mentioned in the documentation of the property.

  was:
h1. Situation

The existing {{ListX}} processors support different "Listing Strategies". One 
commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before, 
the entity is not listed again (unless it was modified).

This has several benefits over other available "Listing Strategies". For 
example, unlike with "No Tracking" the same entity is not listed repeatedly.  
Other than "Tracking Timestamps", entities with an older timestamp than ones 
previously listed can be picked up. 
However, the strategy comes with its own problems.

h1. Problem

Due to the ever given constraints to available memory and performance, entities 
cannot be tracked indefinitely.
That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
Entities" by most processors, introduces the notion of an "Entity Tracking Time 
Window".
All remembered entities that are out of the time window (they are older than 
the current time minus the time window) are removed from the tracking cache, to 
limit memory use. Additionally, not yet listed entities that are out of the 
time window are exempt from listing, as they would be removed from the "cache" 
on the next run immediately, resulting in them being listed over and over. 

However, this results in entities "older" than the specified "Entity Tracking 
Time Window" not being picked up. For example, given entities are listed from a 
remote server and this server is not available for some time. Once the server 
is available again, the listing continues. However, all entities / files that 
were created before the defined time window, will be silently ignored.

As of now, this can be solved by manual intervention, re-starting the ListX 
processor. The 
"Entity Tracking Time Window" can be ignored upon initial listing, when the 
"Entity Tracking Initial Listing Target" is set to "All Available" (default).

However, this requires the NiFi user to be aware of lingering old entities 
being available on the connected remote source. Additionally, the need for 
manual intervention might be undesired / impractical when having a plentiful of 
sources connected.

Additionally, the "Entity Tracking Time Window" can be increased to account for 
longer time frames. However, this only betters the situation somewhat and does 
not solve the problem. Also there is a limit to this, as it increases the 
memory needed.

h1. Proposal

This issue proposes introducing the notion of a "Entity Tracking Strategy", 
whereby the current behavior could be understand as "Tracking Time Window".

An new strategy of "Tracking Recently Listed" is added. Other than they 
existing "Tracking Time Window" strategy, this would not impose any 
prerequisites on the entities regarding they timestamp (see 
{{minTimestampToList}}). Instead, all entities would be considered. 
However, this strategy needs a way to limit / clean the entity cache as well. 
Instead of removing entries that leave the time window, the strategy should 
remember only the last N listed entities. That is, every time an entity is 
listed, it is moved to the front of an "ordered list". In case the entity has 
been listed before, its entry in the "list" is moved to the front nonetheless. 
After every listing, only the first up to N entries are kept. All other, less 
recently listed entities, are removed from cache.

While this strategy solves the problem of listing "old" entities it comes with 
its own downsides. The NiFi user has to configure a sensible value for the 
amount N of maximum cache entries. Failing to do so can result in listing an 
entity more than once (similar to "No Tracking"), when the source provides more 
than N entities at once (e.g. due to a load peak) and the entities are not 
removed from the source until the next listing. Thus this approach works best 
for Flows where the entity is removed from the source short after the listing. 
This behavior should be mentioned in the documentation of the property.


> Introduce an "Entity Tracking Mode" and "Track Last Listing Time" to 
> ListedEntityTracker
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-12595
>                 URL: https://issues.apache.org/jira/browse/NIFI-12595
>             Project: Apache NiFi
>          Issue Type: New Feature
>    Affects Versions: 1.24.0
>            Reporter: endzeit
>            Priority: Major
>
> h1. Situation
> The existing {{ListX}} processors support different "Listing Strategies". One 
> commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
> information of all recently listed entities, e.g. files, is remembered.
> On every listing, in case information to an entity has been remembered 
> before, the entity is not listed again (unless it was modified).
> This has several benefits over other available "Listing Strategies". For 
> example, unlike with "No Tracking" the same entity is not listed repeatedly.  
> Other than "Tracking Timestamps", entities with an older timestamp than ones 
> previously listed can be picked up. 
> However, the strategy comes with its own problems.
> h1. Problem
> Due to the ever given constraints to available memory and performance, 
> entities cannot be tracked indefinitely.
> That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
> Entities" by most processors, introduces the notion of an "Entity Tracking 
> Time Window".
> All remembered entities that are out of the time window (they are older than 
> the current time minus the time window) are removed from the tracking cache, 
> to limit memory use. Additionally, not yet listed entities that are out of 
> the time window are exempt from listing, as they would be removed from the 
> "cache" on the next run immediately, resulting in them being listed over and 
> over. 
> However, this results in entities "older" than the specified "Entity Tracking 
> Time Window" not being picked up. For example, given entities are listed from 
> a remote server and this server is not available for some time. Once the 
> server is available again, the listing continues. However, all entities / 
> files that were created before the defined time window, will be silently 
> ignored.
> As of now, this can be solved by manual intervention, re-starting the ListX 
> processor. The 
> "Entity Tracking Time Window" can be ignored upon initial listing, when the 
> "Entity Tracking Initial Listing Target" is set to "All Available" (default).
> However, this requires the NiFi user to be aware of lingering old entities 
> being available on the connected remote source. Additionally, the need for 
> manual intervention might be undesired / impractical when having a plentiful 
> of sources connected.
> Additionally, the "Entity Tracking Time Window" can be increased to account 
> for longer time frames. However, this only betters the situation somewhat and 
> does not solve the problem. Also there is a limit to this, as it increases 
> the memory needed.
> h1. Proposal
> This issue proposes introducing the notion of a "Entity Tracking Mode", 
> whereby the current behavior could be understand as "Track Entity Timestamp".
> An new mode of "Track Last Listing Time" is added. Other than the existing 
> "Track Entity Timestamp" mode, this would not impose any prerequisites on the 
> entities regarding they timestamp (see {{minTimestampToList}}). Instead, all 
> entities would be considered. 
> However, this strategy needs a way to limit / clean the entity cache as well. 
> Instead of measuring the time window by the timestamp of the entity, the mode 
> should remember the last time the entity was tracked; that is, part of a call 
> to "listEntities" in "trackEntities". That is, every time an entity is 
> listed, its cache entry is renewed. After every listing, only the cache 
> entries that have been updated in the time window will be kept. All other, 
> entities that have not been listed for a longer time, are removed from cache.
> While this mode solves the problem of listing "old" entities it comes with 
> its own downsides. Similar to the existing "Track Entity Timestamp" there is 
> no upper limit on how many cache entries are possible, 
> The NiFi user has to configure a sensible value for the amount N of maximum 
> cache entries. Failing to do so can result in listing an entity more than 
> once (similar to "No Tracking"), when the source provides more than N 
> entities at once (e.g. due to a load peak) and the entities are not removed 
> from the source until the next listing. Thus this approach works best for 
> Flows where the entity is removed from the source short after the listing. 
> This behavior should be mentioned in the documentation of the property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to