[
https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
endzeit updated NIFI-12595:
---------------------------
Description:
h1. Situation
The existing {{ListX}} processors support different "Listing Strategies". One
commonly used "Listing Strategy" is "Tracking Entities" whereby crucial
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before,
the entity is not listed again (unless it was modified).
This has several benefits over other available "Listing Strategies". For
example, unlike with "No Tracking" the same entity is not listed repeatedly.
Other than "Tracking Timestamps", entities with an older timestamp than ones
previously listed can be picked up.
However, the strategy comes with its own problems.
h1. Problem
Due to the ever given constraints to available memory and performance, entities
cannot be tracked indefinitely.
That's why the {{{}ListedEntityTracker{}}}, used for implementing "Tracking
Entities" by most processors, introduces the notion of an "Entity Tracking Time
Window".
All remembered entities that are out of the time window (they are older than
the current time minus the time window) are removed from the tracking cache, to
limit memory use. Additionally, not yet listed entities that are out of the
time window are exempt from listing, as they would be removed from the "cache"
on the next run immediately, resulting in them being listed over and over.
However, this results in entities "older" than the specified "Entity Tracking
Time Window" not being picked up. For example, given entities are listed from a
remote server and this server is not available for some time. Once the server
is available again, the listing continues. However, all entities / files that
were created before the defined time window, will be silently ignored.
As of now, this can be solved by manual intervention, re-starting the ListX
processor. The
"Entity Tracking Time Window" can be ignored upon initial listing, when the
"Entity Tracking Initial Listing Target" is set to "All Available" (default).
However, this requires the NiFi user to be aware of lingering old entities
being available on the connected remote source. Additionally, the need for
manual intervention might be undesired / impractical when having a plentiful of
sources connected.
Additionally, the "Entity Tracking Time Window" can be increased to account for
longer time frames. However, this only betters the situation somewhat and does
not solve the problem. Also there is a limit to this, as it increases the
memory needed.
h1. Proposal
This issue proposes introducing the notion of a "Entity Tracking Mode", whereby
the current behavior could be understand as "Track Entity Timestamp".
An new mode of "Track Last Listing Time" is added. Other than the existing
"Track Entity Timestamp" mode, this would not impose any prerequisites on the
entities regarding their timestamp (see {{{}minTimestampToList{}}}). Instead,
all entities would be considered.
However, this strategy needs a way to limit / clean the entity cache as well.
Instead of measuring the time window by the timestamp of the entity, the mode
should remember the last time the entity was tracked; that is, part of a call
to "listEntities" in "trackEntities". That is, every time an entity is listed,
its cache entry is renewed. After every listing, only the cache entries that
have been updated in the time window will be kept. All other, entities that
have not been listed for a longer time, are removed from cache.
In case users want to limit a processor to only list entities up to a certain
age, most processors have support for this with a separate property already,
e.g. "Maximum File Age" in ListSFTP.
While this mode solves the problem of listing "old" entities it comes with its
own downsides. Due to lifting the restriction on {{{}minTimestampToList{}}},
more entities can be listed, potentially leading to long listing times.
Additionally, similar to the existing "Track Entity Timestamp" there is no
enforced upper limit on how many cache entries are possible. See NIFI-12609 for
a proposal that may address both problems.
was:
h1. Situation
The existing {{ListX}} processors support different "Listing Strategies". One
commonly used "Listing Strategy" is "Tracking Entities" whereby crucial
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before,
the entity is not listed again (unless it was modified).
This has several benefits over other available "Listing Strategies". For
example, unlike with "No Tracking" the same entity is not listed repeatedly.
Other than "Tracking Timestamps", entities with an older timestamp than ones
previously listed can be picked up.
However, the strategy comes with its own problems.
h1. Problem
Due to the ever given constraints to available memory and performance, entities
cannot be tracked indefinitely.
That's why the {{ListedEntityTracker}}, used for implementing "Tracking
Entities" by most processors, introduces the notion of an "Entity Tracking Time
Window".
All remembered entities that are out of the time window (they are older than
the current time minus the time window) are removed from the tracking cache, to
limit memory use. Additionally, not yet listed entities that are out of the
time window are exempt from listing, as they would be removed from the "cache"
on the next run immediately, resulting in them being listed over and over.
However, this results in entities "older" than the specified "Entity Tracking
Time Window" not being picked up. For example, given entities are listed from a
remote server and this server is not available for some time. Once the server
is available again, the listing continues. However, all entities / files that
were created before the defined time window, will be silently ignored.
As of now, this can be solved by manual intervention, re-starting the ListX
processor. The
"Entity Tracking Time Window" can be ignored upon initial listing, when the
"Entity Tracking Initial Listing Target" is set to "All Available" (default).
However, this requires the NiFi user to be aware of lingering old entities
being available on the connected remote source. Additionally, the need for
manual intervention might be undesired / impractical when having a plentiful of
sources connected.
Additionally, the "Entity Tracking Time Window" can be increased to account for
longer time frames. However, this only betters the situation somewhat and does
not solve the problem. Also there is a limit to this, as it increases the
memory needed.
h1. Proposal
This issue proposes introducing the notion of a "Entity Tracking Mode", whereby
the current behavior could be understand as "Track Entity Timestamp".
An new mode of "Track Last Listing Time" is added. Other than the existing
"Track Entity Timestamp" mode, this would not impose any prerequisites on the
entities regarding they timestamp (see {{minTimestampToList}}). Instead, all
entities would be considered.
However, this strategy needs a way to limit / clean the entity cache as well.
Instead of measuring the time window by the timestamp of the entity, the mode
should remember the last time the entity was tracked; that is, part of a call
to "listEntities" in "trackEntities". That is, every time an entity is listed,
its cache entry is renewed. After every listing, only the cache entries that
have been updated in the time window will be kept. All other, entities that
have not been listed for a longer time, are removed from cache.
In case users want to limit a processor to only list entities up to a certain
age, most processors have support for this with a separate property already,
e.g. "Maximum File Age" in ListSFTP.
While this mode solves the problem of listing "old" entities it comes with its
own downsides. Due to lifting the restriction on {{minTimestampToList}}, more
entities can be listed, potentially leading to long listing times.
Additionally, similar to the existing "Track Entity Timestamp" there is no
enforced upper limit on how many cache entries are possible. See NIFI-12609 for
a proposal that may address both problems.
> Introduce an "Entity Tracking Mode" and "Track Last Listing Time" to
> ListedEntityTracker
> ----------------------------------------------------------------------------------------
>
> Key: NIFI-12595
> URL: https://issues.apache.org/jira/browse/NIFI-12595
> Project: Apache NiFi
> Issue Type: New Feature
> Affects Versions: 1.24.0
> Reporter: endzeit
> Assignee: endzeit
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h1. Situation
> The existing {{ListX}} processors support different "Listing Strategies". One
> commonly used "Listing Strategy" is "Tracking Entities" whereby crucial
> information of all recently listed entities, e.g. files, is remembered.
> On every listing, in case information to an entity has been remembered
> before, the entity is not listed again (unless it was modified).
> This has several benefits over other available "Listing Strategies". For
> example, unlike with "No Tracking" the same entity is not listed repeatedly.
> Other than "Tracking Timestamps", entities with an older timestamp than ones
> previously listed can be picked up.
> However, the strategy comes with its own problems.
> h1. Problem
> Due to the ever given constraints to available memory and performance,
> entities cannot be tracked indefinitely.
> That's why the {{{}ListedEntityTracker{}}}, used for implementing "Tracking
> Entities" by most processors, introduces the notion of an "Entity Tracking
> Time Window".
> All remembered entities that are out of the time window (they are older than
> the current time minus the time window) are removed from the tracking cache,
> to limit memory use. Additionally, not yet listed entities that are out of
> the time window are exempt from listing, as they would be removed from the
> "cache" on the next run immediately, resulting in them being listed over and
> over.
> However, this results in entities "older" than the specified "Entity Tracking
> Time Window" not being picked up. For example, given entities are listed from
> a remote server and this server is not available for some time. Once the
> server is available again, the listing continues. However, all entities /
> files that were created before the defined time window, will be silently
> ignored.
> As of now, this can be solved by manual intervention, re-starting the ListX
> processor. The
> "Entity Tracking Time Window" can be ignored upon initial listing, when the
> "Entity Tracking Initial Listing Target" is set to "All Available" (default).
> However, this requires the NiFi user to be aware of lingering old entities
> being available on the connected remote source. Additionally, the need for
> manual intervention might be undesired / impractical when having a plentiful
> of sources connected.
> Additionally, the "Entity Tracking Time Window" can be increased to account
> for longer time frames. However, this only betters the situation somewhat and
> does not solve the problem. Also there is a limit to this, as it increases
> the memory needed.
> h1. Proposal
> This issue proposes introducing the notion of a "Entity Tracking Mode",
> whereby the current behavior could be understand as "Track Entity Timestamp".
> An new mode of "Track Last Listing Time" is added. Other than the existing
> "Track Entity Timestamp" mode, this would not impose any prerequisites on the
> entities regarding their timestamp (see {{{}minTimestampToList{}}}). Instead,
> all entities would be considered.
> However, this strategy needs a way to limit / clean the entity cache as well.
> Instead of measuring the time window by the timestamp of the entity, the mode
> should remember the last time the entity was tracked; that is, part of a call
> to "listEntities" in "trackEntities". That is, every time an entity is
> listed, its cache entry is renewed. After every listing, only the cache
> entries that have been updated in the time window will be kept. All other,
> entities that have not been listed for a longer time, are removed from cache.
> In case users want to limit a processor to only list entities up to a certain
> age, most processors have support for this with a separate property already,
> e.g. "Maximum File Age" in ListSFTP.
> While this mode solves the problem of listing "old" entities it comes with
> its own downsides. Due to lifting the restriction on
> {{{}minTimestampToList{}}}, more entities can be listed, potentially leading
> to long listing times. Additionally, similar to the existing "Track Entity
> Timestamp" there is no enforced upper limit on how many cache entries are
> possible. See NIFI-12609 for a proposal that may address both problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)