[ 
https://issues.apache.org/jira/browse/NIFI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140209#comment-15140209
 ] 

Joseph Witt commented on NIFI-1484:
-----------------------------------

Aldrin and I have discussed an alternative approach involving the use of 
capturing timestamps (two) to keep track of how far we've gotten and whether 
we're on a full time boundary to pull without risk of missing things that come 
in at the same time.  We already have the concept that this processor pulls 
anything that comes in 'after' the last time we pulled things as tracked by 
modification time of the files.  So the proposed concept means we track the 
time of data we pulled last and the time of data we see but did not pull 
because we cannot be sure of its time bound.  We then on the next iteration 
check if sufficient time has passed and if so we can go ahead and pick that 
data up OR we go ahead and run the logic again.  This minimizes false delay, 
honors the timing issue Mark raised, makes a very minimal shared state 
requirement, and still only requires a single scan/pass to get a listing.  I 
believe Aldrin is working a PR for this now.

> ListFile holds unbounded list of files with matching time stamps
> ----------------------------------------------------------------
>
>                 Key: NIFI-1484
>                 URL: https://issues.apache.org/jira/browse/NIFI-1484
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core UI, Extensions
>    Affects Versions: 0.4.0, 0.5.0
>            Reporter: Joseph Witt
>             Fix For: 0.5.0
>
>
> ListFile appears to hold an unbounded set of filenames that match the last 
> timestamp.  While this is understandable to handle the edge case of new data 
> arriving at the same time it presents two problems.  First we hold all of 
> this information in state management which could put considerable pressure on 
> both the local and remote stores but we also have it in memory before we 
> persist it.
> Also, the entire state listing appears to show up in the UI without 
> pagination or any limit on number of entries.  This seems like a problem for 
> the client-side as well.  The server side should probably restrict this.
> Finally, it seems like the need for saving filenames seen at a given 
> timestamp is only necessary if we're assuming the listing we do is 'as-of' 
> RIGHT NOW.  What is instead we did the listing based on a last modified time 
> of 'RIGHTNOW'-1 millisecond or something like that?  Then we should not have 
> to worry at all about keeping a listing of names for the timestamp.
> The reason I think this is important is that it is not at all uncommon for a 
> directory with large quantities of files to have data at the same time due to 
> a copy operation not preserving original file attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to