[ 
https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

endzeit updated NIFI-12595:
---------------------------
    Summary: Introduce an "Entity Tracking Strategy" and "Tracking Recently 
Listed" to ListedEntityTracker  (was: Introduce an "Entity Tracking Strategy" 
to ListedEntityTracker)

> Introduce an "Entity Tracking Strategy" and "Tracking Recently Listed" to 
> ListedEntityTracker
> ---------------------------------------------------------------------------------------------
>
>                 Key: NIFI-12595
>                 URL: https://issues.apache.org/jira/browse/NIFI-12595
>             Project: Apache NiFi
>          Issue Type: New Feature
>    Affects Versions: 1.24.0
>            Reporter: endzeit
>            Priority: Major
>
> h1. Situation
> The existing {{ListX}} processors support different "Listing Strategies". One 
> commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
> information of all recently listed entities, e.g. files, is remembered.
> On every listing, in case information to an entity has been remembered 
> before, the entity is not listed again (unless it was modified).
> This has several benefits over other available "Listing Strategies". For 
> example, unlike with "No Tracking" the same entity is not listed repeatedly.  
> Other than "Tracking Timestamps", entities with an older timestamp than ones 
> previously listed can be picked up. 
> However, the strategy comes with its own problems.
> h1. Problem
> Due to the ever given constraints to available memory and performance, 
> entities cannot be tracked indefinitely.
> That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
> Entities" by most processors, introduces the notion of an "Entity Tracking 
> Time Window".
> All remembered entities that are out of the time window (they are older than 
> the current time minus the time window) are removed from the tracking cache, 
> to limit memory use. Additionally, not yet listed entities that are out of 
> the time window are exempt from listing, as they would be removed from the 
> "cache" on the next run immediately, resulting in them being listed over and 
> over. 
> However, this results in entities "older" than the specified "Entity Tracking 
> Time Window" not being picked up. For example, given entities are listed from 
> a remote server and this server is not available for some time. Once the 
> server is available again, the listing continues. However, all entities / 
> files that were created before the defined time window, will be silently 
> ignored.
> As of now, this can be solved by manual intervention, re-starting the ListX 
> processor. The 
> "Entity Tracking Time Window" can be ignored upon initial listing, when the 
> "Entity Tracking Initial Listing Target" is set to "All Available" (default).
> However, this requires the NiFi user to be aware of lingering old entities 
> being available on the connected remote source. Additionally, the need for 
> manual intervention might be undesired / impractical when having a plentiful 
> of sources connected.
> Additionally, the "Entity Tracking Time Window" can be increased to account 
> for longer time frames. However, this only betters the situation somewhat and 
> does not solve the problem. Also there is a limit to this, as it increases 
> the memory needed.
> h1. Proposal
> This issue proposes introducing the notion of a "Entity Tracking Strategy", 
> whereby the current behavior could be understand as "Tracking Time Window".
> An new strategy of "Tracking Recently Listed" is added. Other than they 
> existing "Tracking Time Window" strategy, this would not impose any 
> prerequisites on the entities regarding they timestamp (see 
> {{minTimestampToList}}). Instead, all entities would be considered. 
> However, this strategy needs a way to limit / clean the entity cache as well. 
> Instead of removing entries that leave the time window, the strategy should 
> remember only the last N listed entities. That is, every time an entity is 
> listed, it is moved to the front of an "ordered list". In case the entity has 
> been listed before, its entry in the "list" is moved to the front 
> nonetheless. After every listing, only the first up to N entries are kept. 
> All other, less recently listed entities, are removed from cache.
> While this strategy solves the problem of listing "old" entities it comes 
> with its own downsides. The NiFi user has to configure a sensible value for 
> the amount N of maximum cache entries. Failing to do so can result in listing 
> an entity more than once (similar to "No Tracking"), when the source provides 
> more than N entities at once (e.g. due to a load peak) and the entities are 
> not removed from the source until the next listing. Thus this approach works 
> best for Flows where the entity is removed from the source short after the 
> listing. This behavior should be mentioned in the documentation of the 
> property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to