Alan Jackoway created NIFI-2705:
-----------------------------------

             Summary: ListHDFS Cannot Be Re-run
                 Key: NIFI-2705
                 URL: https://issues.apache.org/jira/browse/NIFI-2705
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework, Documentation & Website
    Affects Versions: 1.0.0
            Reporter: Alan Jackoway


I have a use case where every day I want to go through a directory in HDFS and 
do something to the files more than a month old.

I was trying to do this with a flow like ListHDFS -> RouteOnAttribute 
(hdfs.lastModified) -> FetchHDFS -> Processing.

However, after I ran it once, old files were not pulled any more. I turned on 
debug logging and got this:

{noformat}
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9] 
o.apache.nifi.processors.hadoop.ListHDFS 
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] Found a total of 3 files in 
HDFS
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9] 
o.apache.nifi.processors.hadoop.ListHDFS 
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] Of the 3 files found in HDFS, 
0 are listable
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9] 
o.apache.nifi.processors.hadoop.ListHDFS 
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] There is no data to list. 
Yielding.
{noformat}

It turns out that ListHDFS maintains state called {{latestTimestampListed}} 
that prevents it from re-listing files unless you change the directory being 
listed. At a minimum, that should be mentioned in the docs on ListHDFS. Better 
would be to make it configurable more like GetHDFS.

In my case I think I can change to using GetHDFS without causing trouble, but 
the behavior of ListHDFS was surprising to me, and as far as I can tell is not 
documented anywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to